# noiseaware_differentially_private_regression_via_metalearning__ab9c149b.pdf

Noise-Aware Differentially Private Regression via Meta-Learning

Ossi Räisä University of Helsinki ossi.raisa@helsinki.fi

Stratis Markou University of Cambridge em626@cam.ac.uk

Matthew Ashman University of Cambridge mca39@cam.ac.uk

Wessel P. Bruinsma Microsoft Research AI for Science wessel.p.bruinsma@gmail.com

Marlon Tobaben University of Helsinki marlon.tobaben@helsinki.fi

Antti Honkela University of Helsinki antti.honkela@helsinki.fi

Richard E. Turner University of Cambridge ret26@cam.ac.uk

Many high-stakes applications require machine learning models that protect user privacy and provide well-calibrated, accurate predictions. While Differential Privacy (DP) is the gold standard for protecting user privacy, standard DP mechanisms typically significantly impair performance. One approach to mitigating this issue is pre-training models on simulated data before DP learning on the private data. In this work we go a step further, using simulated data to train a meta-learning model that combines the Convolutional Conditional Neural Process (Conv CNP) with an improved functional DP mechanism of Hall et al. [2013] yielding the DPConv CNP. DPConv CNP learns from simulated data how to map private data to a DP predictive model in one forward pass, and then provides accurate, well-calibrated predictions. We compare DPConv CNP with a DP Gaussian Process (GP) baseline with carefully tuned hyperparameters. The DPConv CNP outperforms the GP baseline, especially on non-Gaussian data, yet is much faster at test time and requires less tuning.

1 Introduction

Deep learning has achieved tremendous success across a range of domains, especially in settings where large datasets are publicly available. However, in many impactful applications such as healthcare, the data may contain sensitive information about users, whose privacy we want to protect. Differential Privacy [DP; Dwork et al., 2006] is the gold standard framework for protecting user privacy, as it provides strong guarantees on the privacy loss incurred on users participating in a dataset. However, enforcing DP often significantly impairs performance. A recently proposed method to mitigate this issue is to pre-train a model on non-private data, e.g. from a simulator [Tang et al., 2023], and then fine-tune it under DP on real private data [Yu et al., 2021, Li et al., 2022, De et al., 2022].

We go a step further and train a meta-learning model with a DP mechanism inside it (Figure 1). While supervised learning is about learning a mapping from inputs to outputs using a learning algorithm, in meta-learning we learn a learning algorithm directly from the data, by meta-training, enabling generalisation to new datasets during meta-testing. Our model is meta-trained on simulated datasets,

Equal contribution.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

TRAIN: SIMULATED OR PROXY DATA

DP ENCODER DECODER LOSS

x(c) 1... x(c) n

y(c) 1... y(c) n

x(t) 1... x(t) m

y(t) 1... y(t) m

TEST: REAL SENSITIVE DATA

DP ENCODER DECODER PREDICTIONS

x(c) 1... x(c) n

y(c) 1... y(c) n

x(t) 1... x(t) m

PROTECTED DATA

CLIP + NOISE

Figure 1: Meta-training (left) and meta-testing (right) using our method. We train a model on multiple tasks with non-private (simulated or proxy) data to predict on target (t) points using the context (c) points. Crucially, by including a DP mechanism, which clips and adds noise to the data during training, the parameter updates (dashed arrow) teach the model to make well-calibrated and accurate predictions in the presence of DP noise. At test time, we deploy the model on real data using the same mechanism, which protects the context set with DP guarantees.

each of which is split into a context and target set, learning to make predictions at the target inputs given the context set. At meta-test-time, the model takes a context set of real data, which is protected by the DP mechanism, and produces noise-aware predictions and accurate uncertainty estimates.

Neural Processes. Our method is based on neural processes [NPs; Garnelo et al., 2018a], a model which leverages the flexibility of neural networks to produce well-calibrated predictions in the metalearning setting. The parameters of the NP are meta-trained to generalise to unseen datasets, while adapting to new contexts much faster than gradient-based fine-tuning alternatives [Finn et al., 2017].

Convolutional NPs. We focus on convolutional conditional NPs [Conv CNPs; Gordon et al., 2020], a type of NP that has remarkably strong performance in challenging regression problems. That is because the Conv CNP is translation equivariant [TE; Cohen and Welling, 2016], so its outputs change predictably whenever the input data are translated. This is an extremely useful inductive bias when modelling, for example, stationary data. The Conv CNP architecture also makes it natural to embed an especially effective DP mechanism inside it using the functional mechanism [Hall et al., 2013] to protect the privacy of the context set (Figure 1). We call the resulting model the DPConv CNP.

Training with a DP mechanism. A crucial aspect of our approach is training the DPConv CNP on non-sensitive data with the DP mechanism in the training loop. The mechanism involves clipping and adding noise, so applying it only during testing would create a mismatch between training and testing. Training with the mechanism eliminates this mismatch, ensuring calibrated predictions (Figure 2).

Overview of contributions. In summary, our main contributions in this work are as follows.

1. We introduce the DPConv CNP, a meta-learning model which extends the Conv CNP using the functional DP mechanism [Hall et al., 2013]. The model is meta-trained with the mechanism in place, learning to make calibrated predictions from the context data under DP.

2. We improve upon the functional mechanism of Hall et al. [2013] by leveraging Gaussian DP theory [Dong et al., 2022], showing that context set privacy can be protected with smaller amounts of noise (at least 25% lower standard deviation in the settings considered in Figure 4). We incorporate these improvements into DPConv CNP, but note that they are also of interest in any use case of the functional mechanism.

3. We conduct a study on synthetic and sim-to-real tasks. Remarkably, even with relatively few context points (a few hundreds) and modest privacy budgets, the predictions of the DPConv CNP are surprisingly close to those of the non-DP optimal Bayes predictor. Further, we find that a single DPConv CNP can be trained to generalise across generative processes with different statistics and privacy budgets. We also evaluate the DPConv CNP by training it on synthetic data, and testing it on a real dataset in the small data regime. In all cases, the DPConv CNP produces well calibrated predictions, and is competitive with a carefully tuned DP Gaussian process baseline.

2 Related Work

Training deep learning models on public proxy datasets and then fine-tuning with DP on private data is becoming increasingly common in computer vision and natural language processing applications

=1.00 =0.001 N=256 =1.00 =0.001 N=512 =3.00 =0.001 N=256

DPConv CNP Optimal Bayes predictions (non-private)

Figure 2: Training our proposed model with a DP mechanism inside it, enables the model to make accurate well-calibrated predictions, even for modest privacy budgets and dataset sizes. Here, the context data (black) are protected with different (ϵ, δ) DP budgets as indicated. The model makes predictions (blue) that are remarkably close to the optimal (non-private) Bayes predictor.

[Yu et al., 2021, Li et al., 2022, De et al., 2022, Tobaben et al., 2023]. However, these approaches rely on the availability of very large non-sensitive datasets. Because these datasets would likely need to be scraped from the internet, it is unclear whether they are actually non-sensitive [Tramèr et al., 2024]. On the other hand, other approaches study meta-learning with DP during meta-training [Li et al., 2020, Zhou and Bassily, 2022], but do not enforce privacy guarantees at meta-test time.

Our approach fills a gap in the literature by enforcing privacy of the meta-test data with DP guarantees (see Figure 1), and using non-sensitive proxy data during meta-training. Unlike other approaches which rely on large fine tuning datasets, our method produces well-calibrated predictions even for relatively small datasets (a few hundred datapoints). In this respect, the work of Smith et al. [2018], who study Gaussian process (GP) regression under DP for the small data regime, is perhaps most similar to ours. However, Smith et al. [2018] enforce privacy constraints only with respect to the output variables and do not protect the input variables, whereas our approach protects both.

In terms of theory, there is fairly limited prior work on releasing functions with DP guarantees. Our method is based on the functional DP mechanism of Hall et al. [2013] which works by adding noise from a GP to a function to be released. This approach works especially well when the function lies in a reproducing kernel Hilbert space (RKHS), a property which we leverage in the DPConv CNP. We improve on the original functional mechanism by leveraging Gaussian DP theory of Dong et al. [2022]. In related work, Aldà and Rubinstein [2017] develop the Bernstein DP mechanism, which adds noise to the coefficients of the Bernstein polynomial of the released function, and Mirshani et al. [2019] generalise the functional mechanism beyond RKHSs. Jiang et al. [2023] derive Rényi differential privacy [RDP; Mironov, 2017] bounds for the mechanism of Hall et al. [2013].

3 Background

We start by laying the necessary background. In Section 3.1, we outline meta-learning and NPs, focusing on the Conv CNP. In Section 3.2 we introduce DP, and the functional mechanism of Hall et al. [2013]. We keep the discussion on DP lightweight, deferring technical details to Appendix A.

3.1 Meta-learning and Neural Processes

Supervised learning. Let D be the set of datasets consisting of (x, y)-pairs with x X Rd and y Y R. The goal of supervised learning is to use a dataset D D to learn appropriate parameters θ for a conditional distribution p(y|x, θ), which maximise the predictive log-likelihood on unseen, randomly sampled test pairs (x , y ), i.e. L(θ, (x , y )) = log p(y |x , θ). Let us denote the entire algorithm that performs learning, followed by prediction, by π, that is π(x , D) = p( |x , θ ), where θ = arg maxθ L(r, D). Supervised learning is concerned with designing a hand-crafted π, e.g. picking an appropriate architecture and optimiser, which is trained on a single dataset D.

Meta-learning. Meta-learning can be regarded as supervised learning of the function π itself. In this setting, D is regarded as part of a single training example, which means that a meta-learning algorithm can handle different D at test time. Concretely, in meta-learning, we have πθ,ϕ(x , D) = p( |x , θ, rϕ(D)), where rϕ is now a function that produces task-specific parameters, adapted for D. The meta-training set now consists of a collection of datasets (Dm)M m=1, often referred to as tasks. Each task is partitioned into a context set D(c) = (x(c), y(c)) and a target set D(t) = (x(t), y(t)). We refer to x(c)and y(c) as the context inputs and outputs and to x(t) and y(t) as the target inputs and outputs. To meta-train a meta-learning model, we optimise its predictive log-likelihood, averaged over tasks, i.e. ED[L(θ, ϕ, D)] = ED[log πθ,ϕ(x(t), D(c))(y(t))]. Meta-learning algorithms are broadly categorised into two groups, based on the choice of rϕ [Bronskill, 2020].

Gradient based vs amortised meta-learning. On one hand, gradient-based methods, such as MAML [Finn et al., 2017] and its variants (e.g. [Nichol et al., 2018]) rely on gradient-based fine-tuning at test time. Concretely, these let rϕ be a function that performs gradient-based optimisation. For such algorithms, we can enforce DP with respect to a meta-test time dataset by fine-tuning with a DP optimisation algorithm, such as DP-SGD [Abadi et al., 2016]. While generally effective, such approaches can require significant resources for fine-tuning at meta-test-time, as well as careful DP hyper-parameter tuning to work at all. On the other hand, there are amortised methods, such as neural processes [Garnelo et al., 2018a], prototypical networks [Snell et al., 2017], and matching networks [Vinyals et al., 2016], in which rϕ is a learnable function, such as a neural network. This approach has the advantage that it requires far less compute and memory at meta-test-time. In this work, we focus on neural processes (NPs), and show how rϕ can be augmented with a DP mechanism to make well calibrated predictions, while protecting the context data at meta test time.

Neural Processes. Neural processes (NPs) are a type of model which leverage the flexibility of neural networks to produce well calibrated predictions. A range of NP variants have been developed, including conditional NPs [CNPs; Garnelo et al., 2018a], latent-variable NPs [LNPs; Garnelo et al., 2018b], Gaussian NPs [GNPs; Markou et al., 2022], score-based NPs Dutordoir et al. [2023], and autoregressive NPs [Bruinsma et al., 2023]. In this work, we focus on CNPs because these are ideally suited for our purposes, but our framework can be extended to other variants. A CNP consists of an encoder encϕ, and a decoder decθ. The encoder is a neural network which ingests a context set D(c) D and outputs a representation r in some representation space R. Two concrete examples of such encoders are Deep Sets [Zaheer et al., 2017] and Set Conv layers [Gordon et al., 2020]. The decoder is another neural network, with parameters θ, which takes the representation r together with target inputs x(t) and produces predictions for the corresponding y(t). In summary

πϕ,θ(x(t), D(c)) = decθ(x(t), r), r = encϕ(D(c)). (1)

In CNPs, a standard choice, which we also use here, is to let πϕ,θ(x(t), D(c)) return a mean µϕ,θ(x(t), D(c)) and a variance σ2 ϕ,θ(x(t), D(c)), to parameterise a predictive distribution that factorises across the target points y(t)|x(t) N(µϕ,θ(x(t), D(c)), σ2 ϕ,θ(x(t), D(c))). We note that our framework straightforwardly extends to more complicated πϕ,θ(x(t), D(c)). To train a CNP to make accurate predictions, we can optimise a log-likelihood objective Garnelo et al. [2018a] such as

L(θ, ϕ) = ED h log N y(t) m |µϕ,θ(x(t) m , D(c)), σ2 ϕ,θ(x(t), D(c)) i , (2)

Algorithm 1 Meta-training a neural process.

Input: Simulated datasets (Dm)M m=1, encoder encϕ, decoder decθ, iterations T, optimiser opt Output: Optimised parameters ϕ, θ for i {1 . . . T} do

Choose D from (Dm)M m=1 randomly D(c), D(t) D x(t), y(t) D(t) µ, σ2 decθ(x(t), encϕ(D(c))) L(θ, ϕ) log N(y(t)|µ, σ2) ϕ, θ opt(ϕ, θ, ϕ,θL) end for Return ϕ, θ

Algorithm 2 Meta-testing a neural process.

Input: Real context D(c), encϕ, decθ Output: Predictive µ, σ, with domain X µ( ), σ( ) decθ( , encϕ(D(c))) Return µ, σ

where the expectation is taken over the distribution over tasks D. This objective is optimised by presenting each task Dm to the CNP, computing the gradient of the loss with back-propagation, and updating the parameters (ϕ, θ) of the CNP with any first-order optimiser (see alg. 1). This process trains the CNP to make well-calibrated predictions for D(t) given D(c). At test time, given a new D(c), we can use πϕ,θ which can be queried at arbitrary target inputs, to obtain corresponding predictions (alg. 2).

Convolutional CNPs. Whenever we have useful inductive biases or other prior knowledge, we can leverage these by building them directly into the encoder and the decoder of the CNP. Stationarity is a powerful inductive bias that is often encountered in potentially sensitive applications such as time series or spatio-temporal regression. Whenever the generating process is stationary, the corresponding Bayesian predictive posterior is TE [Foong et al., 2020]. Conv CNPs leverage this inductive bias using TE architectures [Cohen and Welling, 2016, Huang et al., 2023].

Conv CNP encoder. To achieve TE, the Conv CNP encoder produces an r that is itself a TE function.

n ) r(s)(x)=

n )+ dgd(x)

n )+ sgs(x)

Figure 3: Left; Illustration of the Conv CNP encoder encϕ. Black crosses show an example context set D(c). The density channel r(d) is shown in purple and the signal channel r(r) is shown in red. The representation r consists of concatenating r(d) and r(s). Right; Illustration of the DPConv CNP encoder. Black crosses show an example context D(c), clipped with a threshold C (gray dashed). Here, a single point (rightmost) is clipped (gray cross shows value before clipping). The density and signal channels are computed and GP noise is added to obtain the DP representation (red & purple).

Specifically, encϕ maps the context D(c) = ((x(c) n , y(c) n ))N n=1 to the function r : X R2

r(x) = r(d)(x) r(s)(x)

where ψ is the Gaussian radial basis function (RBF) and ϕ = {λ}. We refer to the two channels of r as the density r(d) and the signal r(s) channels, which can be viewed as a smoothed version of D(c). The density channel carries information about the inputs of the context data, while the signal channel carries information about the outputs. This encoder is referred to as the Set Conv.

Conv CNP decoder. Once r has been computed, it is passed to the decoder which performs three steps. First, it discretises r using a pre-specified resolution. Then, it applies a CNN to the discretised signal, and finally it uses an RBF smoother akin to Equation (3) to make predictions at arbitrary target locations. The aforementioned steps are all TE so, composing them with the TE encoder produces a TE prediction map [Bronstein et al., 2021]. The Conv CNP has universal approximator properties and produces state-of-the-art, well-calibrated predictions [Gordon et al., 2020].

3.2 Differential Privacy

Differential privacy [Dwork et al., 2006, Dwork and Roth, 2014] quantifies the maximal privacy loss to data subjects that can occur when the results of analysis are released. The loss is quantified by two numbers, ϵ and δ, which bound the change in the distribution of the output of an algorithm, when the data of a single data subject in the dataset change.

Definition 3.1. An algorithm M is (ϵ, δ)-DP if for neighbouring D, D and all measurable sets S

Pr(M(D) S) eϵ Pr(M(D ) S) + δ. (4)

We consider D RN d with N users and d dimensions, and use the substitution neighbourhood relation S where D S D if D and D differ by at most one row.

Gaussian DP. In Section 3.3 we discuss the functional mechanism of Hall et al. [2013], which we use in the Conv CNP. However, the original privacy guarantees derived by Hall et al. [2013] are suboptimal. We improve upon these using the framework of Gaussian DP [GDP; Dong et al., 2022]. Dong et al. [2022] define GDP from a hypothesis testing perspective, which is not necessary for our purposes. Instead, we present GDP through the following conversion formula between GDP and DP.

Definition 3.2. A mechanism M is µ-GDP if and only if it is (ϵ, δ(ϵ))-DP for all ϵ 0, where

and Φ is the CDF of the standard Gaussian distribution.

Properties of (G)DP. Differential privacy has several useful properties. First, post-processing immunity guarantees that post-processing the result of a DP algorithm does not cause privacy loss:

Theorem 3.3 (Dwork and Roth 2014). Let M be an (ϵ, δ)-DP (or µ-GDP) algorithm and let f be any, possibly randomised, function. Then f M is (ϵ, δ)-DP (or µ-GDP).

Composition of DP mechanisms refers to running multiple mechanisms on the same data. When each mechanism can depend on the outputs of the previous mechanisms, the composition is called adaptive. GDP is particularly appealing because it has a simple and tight composition formula:

Theorem 3.4 (Dong et al. 2022). The adaptive composition of T mechanisms that are µi-GDP (i = 1, . . . , T), is µ-GDP with µ = p

µ2 1 + + µ2 T .

Gaussian mechanism. One of the central mechanisms to guarantee DP, is the Gaussian mechanism. This releases the output of a function f with added Gaussian noise

M(D) = f(D) + N(0, σ2I), (6)

where the variance σ2 depends on the l2-sensitivity of f, defined as

= sup D D ||f(D) f(D )||2. (7)

Theorem 3.5 (Dong et al. 2022). The Gaussian mechanism with variance σ2 = 2/µ2 is µ-GDP.

3.3 The Functional Mechanism

Now we turn to the functional mechanism of Hall et al. [2013]. Given a dataset D RN d, the functional mechanism releases a function f D : T R, where T Rd, with added noise from a Gaussian process. For simplicity, here we only define the functional mechanism for functions in a reproducible kernel Hilbert space (RKHS), and defer the more general definition to Appendix A.2.

Definition 3.6. Let g be a sample path of a Gaussian process having mean zero and covariance function k, and let H be an RKHS with kernel k. Let {f D : D D} H be a family of functions indexed by datasets, satisfying

Hf def = sup D D ||f D f D ||H . (8)

The functional mechanism with multiplier c and sensitivity is defined as

M(D) = f D + cg. (9)

Theorem 3.7 (Hall et al.). If ϵ 1, the mechanism in Def. 3.6 with c =

2 ln(2/δ) is (ϵ, δ)-DP.

4 Differential privacy for the Conv CNP

Now we turn to our main contributions. First, we tighten the functional mechanism privacy analysis in Section 4.1 and then we build the functional mechanism into the Conv CNP in Section 4.2.

4.1 Improving the Functional Mechanism

The privacy bounds given by Theorem 3.7 are suboptimal, and do not allow us to use the tight composition formula from Theorem 3.4. However, the proof of Theorem 3.7 builds on the classical Gaussian mechanism privacy bounds, which we can replace with the GDP theory from Section 3.2. As demonstrated in Figure 4, our bound offers significantly smaller ϵ for the same noise standard deviation, compared to the existing bounds of Hall et al. [2013] and Jiang et al. [2023].

Theorem 4.1. The functional mechanism with sensitivity and multiplier c = /µ is µ-GDP.

Proof. The proof of Theorem 3.7 from Hall et al. [2013] shows that any (ϵ, δ)-DP bound for the Gaussian mechanism carries over to the functional mechanism. Replacing the classical Gaussian mechanism bound with the GDP bound proves the claim. For details, see Appendix A.

Algorithm 3 DPSet Conv; modifications to the original Set Conv layer shown in blue.

Input: Grid x RD, D(c), (ϵ, δ), RBF covariance k with scale λ, threshold C, DP accounting method noise_scales. Output: DP representation of r(d), r(s).

y(c) n clip(y(c) n , C) for n = 1, . . . , N gd, gs GP(0, k) gd, gs gd(x), gs(x) σd, σs noise_scales(ϵ, δ, C) r(d) PN n=1 ψ((x x(c) n )/λ) + σdgd r(s) PN n=1 y(c) n ψ((x x(c) n )/λ) + σsgs Return: Density and signal r(d), r(s).

0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

102 Classical (Hall et al.) RDP (Jiang et al.) GDP (Ours)

% reduction in

(Ours vs Jiang et al.)

Figure 4: Noise magnitude comparison for the classical functional mechanism of Hall et al. [2013], the RDP-based mechanism of Jiang et al. [2023] and our improved GDP-based mechanism. The line for Hall et al. cuts off at ϵ = 1 since their bound has only been proven for ϵ 1. We set 2 = 10 and δ = 10 3, which are representative values from our experiments. See Appendix A.6 for more details.

4.2 Differentially Private Convolutional CNP

Differentially Private Set Conv. Now we turn to building the functional DP mechanism into the Conv CNP. We want to modify the Set Conv encoder (Eq. 3) to make it DP. As a reminder, the Set Conv outputs the density r(d) and signal r(s) channels r(d)(x) r(s)(x)

which are the two quantities we want to release under DP. To achieve this, we must first determine the sensitivity of r(d) and r(s), as defined in Eq. 8. Recall that we use the substitution neighbourhood relation S, defined as D(c) 1 S D(c) 2 if D(c) 1 and D(c) 2 differ in at most one row, i.e. by a single context point. Since the RBF ψ is bounded above by 1, it can be shown (see Appendix A.4) that the squared l2-sensitivity of r(d) is bounded above by 2, and this bound is tight. Unfortunately however, since the signal channel r(s) depends linearly on each y(c) n (see Eq. 10), its sensitivity is unbounded. To address this, we clip each y(c) n by a threshold C, which is a standard way to ensure the sensitivity is bounded. With this modification we obtain the following tight sensitivities for r(d) and r(s):

2 Hr(d) = 2, 2 Hr(s) = 4C2 (11)

With these in place, we can state our privacy guarantee which forms the basis of the DPConv CNP. Postprocessing immunity (Theorem 3.3) ensures that post-processing r(s) and r(d) with the Conv CNP decoder does not result in further privacy loss.

Theorem 4.2. Let gd and gs be sample paths of two independent Gaussian processes having zero mean and covariance function k, such that 0 k Ck for some Ck > 0. Let 2 d = 2Ck and 2 s = 4C2Ck. Then releasing r(d) + σdgd and r(s) + σsgs is µ-GDP with µ = p

2 s/σ2 s + 2 d/σ2 d.

Proof. The result follows by starting from the GDP bound of the mechanism in Theorem 4.1 and using Theorem 3.4 to combine the privacy costs for the releases of r(d) and r(s).

Corollary 4.3. Algorithm 2 with the DPSet Conv encoder from Algorithm 3 is (ϵ, δ)-DP with respect to the real context set D(c).

Proof. The noise_scales method in Algorithm 3 computes the appropriate σd and σs values from Theorem 4.2 and Definition 3.2 such that releasing the functional encodings r(d) + σdgd and r(s) +σsgs is (ϵ, δ)-DP. The (ϵ, δ)-DP guarantee extends [Hall et al., 2013, Proposition 5] to the point evaluations r(d) and r(s) over the grid x in Algorithm 3. Post-processing immunity (Theorem 3.3) extends (ϵ, δ)-DP to Algorithm 2.

4.3 Training the DPConv CNP

Training loss and algorithm. We meta-train the DPConv CNP parameters θ, ϕ using the CNP log-likelihood (eq. 2) within Algorithm 1, and meta-test it using alg. 2. Importantly, the encoder

encϕ now includes clipping and adding noise (alg. 3) in its forward pass. Meta-training with the functional in place is crucial, because it teaches the decoder to handle the DP noise and clipping.

Privacy hyperparameters. By Definition 3.2 and Theorem 4.2, each (ϵ, δ)-budget implies a µbudget, placing a constraint on the sensitivities and noise magnitudes, namely µ2 = 2 s/σ2 s + 2 d/σ2 d. Since ψ is an RBF, 2 d = 2 and 2 s = 4C2, and we need to specify C, σs and σd, subject to this constraint. We introduce a variable 0 < t < 1 and rewrite the constraint as

tµ2 and σ2 d = 2 (1 t)µ2 (12)

allowing us to freely set t and C. One straightforward approach is to fix t and C to hand-picked values, but this is sub-optimal since the optimal values depend on µ, N, and the data statistics. Instead, we can make them adaptive, letting t : R+ N (0, 1) and C : R+ N R+ be learnable functions, e.g. neural networks t(µ, N) = sig(NNt(µ, N)) and C(µ, N) = exp(NNC(µ, N)) where sig is the sigmoid. These networks are meta-trained along with all other parameters of the DPConv CNP.

5 Experiments & Discussion

We conduct experiments on synthetic and a sim-to-real task with real data. We provide the exact experimental details in Appendix E. We make our implementation of the DPConv CNP public in the repository https://github.com/cambridge-mlg/dpconvcnp.

DP-SGD baseline. Since, we are interested in the small-data regime, i.e. a few hundred datapoints per task, we turn to Gaussian processes [GP; Rasmussen and Williams, 2006], the gold-standard model for well-calibrated predictions in this setting. To enforce DP, we make the GP variational [Titsias, 2009], and use DP-SGD [Abadi et al., 2016] to optimise its variational parameters and hyperparameters. This is a strong baseline because GPs excel in small data, and DP-SGD is a stateof-the-art DP fine-tuning algorithm. We found it critical to carefully tune the DP-SGD parameters and the GP initialisation using Bayes Opt, and devoted substantial compute on this to ensure we have maximised GP performance. We refer to this baseline as the DP-SVGP. For details see Appendix D.

Figure 5: Deployment-time comparison on Gaussian (top) and non-Gaussian (bottom) data. We ran the DP-SVGP for different numbers of DP-SGD steps to determine a speed versus quality-of-fit tradeoff. Reporting 95% confidence intervals.

General setup. In both synthetic and sim-toreal experiments, we first tuned the DP as well as the GP initialisation parameters of the DPSVGP on synthetic data using Bayes Opt. We then trained the DPConv CNP on synthetic data from the same generative process. Last, we tested both models on unseen test data. For the DP-SVGP, testing involves DP fine-tuning its variational parameters and its hyperparameters on each test set. For the DPConv CNP, testing involves a single forward pass through the network. We report results in Figures 6 and 7, and discuss them below.

5.1 Synthetic tasks

Gaussian data. First, we generated data from a GP with an exponentiated quadratic (EQ) covariance (Figure 6; top), fixing its signal and noise scales, as well as its lengthscale ℓ. For each ℓwe sampled datasets with N U[1, 512] and privacy budgets with ϵ U[0.90, 4.00] and δ = 10 3. We trained separate DP-SVGPs and DPConv CNPs for each ℓand tested them on unseen data from the same generative process (non-amortised; Figure 6). These models can handle different privacy budgets but only work well for the lengthscale they were trained on. In practice an appropriate lengthscale is not known a priori. To make this task more realistic, we also trained a single DPConv CNP on data with randomly sampled ℓ U[0.25, 2.00] (amortised; Figure 6). This model implicitly infers ℓand simultaneously makes predictions, under DP. We also show the performance of the non-DP Bayes posterior, which is optimal (oracle; Figure 6 top). See Appendix E.1 for more details.

DPConv CNP competes with DP-SVGP. Even in the Gaussian setting, where the DP-SVGP is given the covariance of the generative process, the DPConv CNP remains competitive (red and

=1.00, =0.25 =1.00, =0.71 =1.00, =2.00

0 100 200 300 400 500

=3.00, =0.25

0 100 200 300 400 500

=3.00, =0.71

0 100 200 300 400 500

=3.00, =2.00

=1.00, =4.00 =1.00, =2.00 =1.00, =1.00

0 100 200 300 400 500 N

=3.00, =4.00

0 100 200 300 400 500 N

=3.00, =2.00

0 100 200 300 400 500 N

=3.00, =1.00

Oracle DP-SVGP (non-amortised) DPConv CNP (non-amortised) DPConv CNP (amortised)

Figure 6: Negative log-likelihoods (NLL) of the DPConv CNP and the DP-SVGP baseline on synthetic data from a EQ GP (top two rows; EQ lengthscale ℓ) and non-Gaussian data from sawtooth waveforms (bottom two rows; waveform period τ). For each point shown we report the mean NLL with its 95% confidence intervals (error bars too small to see). See Appendix C.2 for example fits.

purple in Figure 6; top). While the DP-SVGP outperforms the DPConv CNP for some N and ℓ, the gaps are typically small. In contrast, the DP-SVGP often fails to provide sensible predictions (see ℓ= 0.25, N 300), and tends to overestimate the lengthscale, which is a known challenge in variational GPs [Bauer et al., 2016]. We also found that the DP-SVGP tends to underestimate the observation noise, resulting in over-smoothed and over-confident predictions which lead to a counter-intuitive reduction in performance as N increases. By contrast, the DPConv CNP gracefully handles different N and recovers predictions that are close to the non-DP Bayesian posterior for modest ϵ and N, with runtimes several orders of magnitude faster than the DP-SVGP (Figure 5).

Amortising over ℓand privacy budgets. We observe that the DPConv CNP trained on a range of lengthscales (green; Figure 6) accurately infers the lengthscale of the test data, with only a modest performance reduction compared to its non-amortised counterpart (red). The ability of the DPConv CNP to implicitly infer ℓwhile making calibrated predictions is remarkable, given the DP constraints under which it operates. Further, we observe that the DPConv CNP works well across a range of privacy budgets. In preliminary experiments, we found that the performance loss due to amortising over privacy budgets is small. This is particularly appealing because a single DPConv CNP can be trained on a range of budgets and deployed at test time using the privacy level specified by the practitioner, eliminating the need for separate models for different budgets.

Non-Gaussian synthetic tasks. We generated data from a non-Gaussian process with sawtooth signals, which has previously been identified as a challenging task Bruinsma et al. [2023]. We sampled the waveform direction and phase using a fixed period τ and adding Gaussian observation noise with a fixed magnitude. We gave the DP-SVGP an advantage by using a periodic covariance function, and truncating the Fourier series of the waveform signal to make it continuous: otherwise, since the DP-SVGP cannot handle discontinuities in the sawtooth signal, it explains the data mostly as noise, failing catastropically. Again, we trained a separate DP-SVGP and DPConv CNP for each τ, as well as a single DPConv CNP model on randomly sampled τ 1 U[0.20, 1.25]. We report results in Figure 6 (bottom), along with a non-DP oracle (blue). The Bayes posterior is intractable, so we report the average NLL of the observation noise, which is a lower bound to the NLL.

DPConv CNP outperforms the DP-SVGP. We find that, even though we gave the DP-SVGP significant advantages, the DPConv CNP still outperforms it, and produces near-optimal predictions even for modest N and ϵ. Overall, our findings in the non-Gaussian tasks mirror those of the Gaussian tasks. The DPConv CNP can amortise over different signal periods with very small performance drops (red, green in Figure 6; bottom). Given the difficulty of this task, the fact that the DPConv CNP can predict accurately for signals with different periods under DP constraints is especially impressive.

0.5 1.0 1.5 2.0 2.5

age / height age / weight

0 100 200 300 400 N

0.5 1.0 1.5 2.0 2.5

0 100 200 300 400 N

DP-SVGP DPConv CNP

Height (cm)

DPConv CNP Context Target

DP-SVGP Context Target

0 20 40 60 80 Age (years)

Weight (kg)

DPConv CNP Context Target

0 20 40 60 80 Age (years)

DP-SVGP Context Target

Figure 7: Left; Negative log-likelihoods of the DPConv CNP and the DP-SVGP baseline on the sim to real task with the !Kung dataset, predicting individuals height from their age (left col.) or their weight from their age (right col.). For each point shown here, we partition each dataset into a context and target at random, make predictions, and repeat this procedure 512 times. We report mean NLL with its 95% confidence intervals. Error bars are to small to see here. Right; Example predictions for the DPConv CNP and the DP-SVGP, showing the mean and 95% confidence intervals, with N = 300, ϵ = 1.00, δ = 10 3. The DPConv CNP is visibly better-calibrated than the DP-SVGP.

5.2 Sim-to-real tasks

Sim-to-real task. We evaluated the performance of the DPConv CNP in a sim-to-real task, where we train the model on simulated data and test it on the the Dobe !Kung dataset [Howell, 2009], also used by Smith et al. [2018], containing age, weight and height measurements of 544 individuals. We generated data from GPs with a Matérn-3/2 covariance, with a fixed signal scale of σv = 1.00, randomly sampled noise scale σn U[0.20, 0.60] and lengthscale ℓ U[0.50, 2.00]. We chose Matérn-3/2 since its paths are rougher than those of the EQ, and picked hyperparameter ranges via a back-of-the envelope calculation, without tuning them for the task. We trained a single DP-SVGP and a DPConv CNP with ϵ U[0.90, 4.00] and δ = 10 3. We consider two test tasks: predicting the height or the weight of an individual from their age. For each N, we split the dataset into a context and target at random, repeating the procedure for multiple splits.

Sim-to-real comparison. While the two models perform similarly for large N, the DPConv CNP performs much better for smaller N (Figure 7; left). The DPConv CNP predictions are surprisingly good even for strong privacy guarantees, e.g. ϵ = 1.00, δ = 10 3, and a modest dataset size (Figure 7; right), and significantly better-calibrated than those of the DP-SVGP, which under-fits. Note we have not tried to tune the simulator or add prior knowledge, which could further improve performance.

6 Limitations & Conclusion

Limitations. The DPConv CNP does not model dependencies between target outputs, which is a major limitation. This could be achieved straightforwardly by extending our approach to LNPs, GNPs, or ARNPs. Another limitation is that the efficacy of any sim-to-real scheme is limited by the quality of the simulated data. If the real and the simulated data differ substantially, then sim-to-real transfer has little hope of working. This can be mitigated by simulating diverse datasets to ensure the real data are in the training distribution. However, as simulator diversity increases, predictions typically become less certain, so there is a sweet spot in simulator diversity. While we observed strong sim-to-real results, exploring the effect of this diversity is a valuable direction for future work.

Broader Impacts. This paper presents work whose goal is to advance the field of DP. Generally, we view the potential for broader impact of this work as generally positive. Ensuring individual user privacy is critical across a host of Machine Learning applications. We believe that methods such as ours, aimed at improving the performance of DP algorithms and improve their practicality, have the potential to have a positive impact on individual users of Machine Learning models.

Conclusion. We proposed an approach for DP meta-learning using NPs. We leveraged and improved upon the functional DP mechanism of Hall et al. [2013], and showed how it can be naturally built into the Conv CNP to protect the privacy of the meta-test set with DP guarantees. Our improved bounds for the functional DP mechanism are substantial, providing the same privacy guarantees with a 30% lower noise magnitude, and are likely of independent interest. We showed that the DPConv CNP is competitive and often outperforms a carefully tuned DP-SVGP baseline on both Gaussian and non-Gaussian synthetic tasks, while simultaneously being orders of magnitude faster at meta-test time. Lastly, we demonstrated how the DPConv CNP can be used as a sim-to-real model in a realistic evaluation scenario in the small data regime, where it outperforms the DP-SVGP baseline.

Acknowledgements

This work was supported in part by the Research Council of Finland (Flagship programme: Finnish Center for Artificial Intelligence, FCAI as well as Grants 356499 and 359111), the Strategic Research Council at the Research Council of Finland (Grant 358247) as well as the European Union (Project 101070617). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. SM is supported by the Vice Chancellor s and Marie and George Vergottis Scholarship, and the Qualcomm Innovation Fellowship. Richard E. Turner is supported by Google, Amazon, ARM, Improbable, EPSRC grant EP/T005386/1, and the EPSRC Probabilistic AI Hub (Prob AI, EP/Y028783/1).

Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan Mc Mahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016.

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2623 2631, 2019.

Francesco Aldà and Benjamin Rubinstein. The Bernstein Mechanism: Function Release under Differential Privacy. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.

Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems, volume 29, 2016.

John Bronskill. Data and computation efficient meta-learning. Ph D thesis, University of Cambridge, 2020.

Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021.

Wessel Bruinsma, Stratis Markou, James Requeima, Andrew Y. K. Foong, Tom Andersson, Anna Vaughan, Anthony Buonomo, Scott Hosking, and Richard E Turner. Autoregressive conditional neural processes. In The Eleventh International Conference on Learning Representations, 2023.

Taco Cohen and Max Welling. Group equivariant convolutional networks. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2990 2999. PMLR, 2016.

Soham De, Leonard Berrada, Jamie Hayes, Samuel L. Smith, and Borja Balle. Unlocking High-Accuracy Differentially Private Image Classification through Scale. ar Xiv preprint ar Xiv:2204.13650, 2022.

Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian Differential Privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):3 37, 2022.

Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, and Fergus Simpson. Neural diffusion processes. In International Conference on Machine Learning, pages 8990 9012. PMLR, 2023.

Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211 407, 2014.

Cynthia Dwork, Frank Mc Sherry, Kobbi Nissim, and Adam D. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In Third Theory of Cryptography Conference, volume 3876 of Lecture Notes in Computer Science, pages 265 284. Springer, 2006.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning. PMLR, 2017.

Andrew Foong, Wessel Bruinsma, Jonathan Gordon, Yann Dubois, James Requeima, and Richard Turner. Meta-learning stationary stochastic process prediction with convolutional neural processes. In Advances in Neural Information Processing Systems, volume 33, pages 8284 8295, 2020.

Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and S. M. Ali Eslami. Conditional Neural Processes. In Proceedings of the 35th International Conference on Machine Learning, pages 1704 1713. PMLR, 2018a.

Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. Neural Processes. ar Xiv preprint ar Xiv:1807.01622, 2018b.

Jonathan Gordon, Wessel P. Bruinsma, Andrew Y. K. Foong, James Requeima, Yann Dubois, and Richard E. Turner. Convolutional Conditional Neural Processes. In International Conference on Learning Representations, 2020.

Rob Hall, Alessandro Rinaldo, and Larry A. Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14(1):703 727, 2013.

Nancy Howell. Dobe !kung census of all population., 2009. URL https://tspace.library. utoronto.ca/handle/1807/17973. License: All right reserved.

Daolang Huang, Manuel Haussmann, Ulpu Remes, S. T. John, Grégoire Clarté, Kevin Luck, Samuel Kaski, and Luigi Acerbi. Practical Equivariances via Relational Conditional Neural Processes. In Advances in Neural Information Processing Systems, volume 36, 2023.

Dihong Jiang, Sun Sun, and Yaoliang Yu. Functional Rényi Differential Privacy for Generative Modeling. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023.

Jeffrey Li, Mikhail Khodak, Sebastian Caldas, and Ameet Talwalkar. Differentially Private Meta Learning. In International Conference on Learning Representations, 2020.

Xuechen Li, Florian Tramèr, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. In The Tenth International Conference on Learning Representations, ICLR, 2022.

Stratis Markou, James Requeima, Wessel P. Bruinsma, Anna Vaughan, and Richard E. Turner. Practical conditional neural processes via tractable dependent predictions. In Proceedings of the 10th International Conference on Learning Representations, 2022.

Ilya Mironov. Rényi Differential Privacy. In 30th IEEE Computer Security Foundations Symposium, pages 263 275, 2017.

Ardalan Mirshani, Matthew Reimherr, and Aleksandra Slavkovi c. Formal Privacy for Functional Data with Gaussian Perturbations. In Proceedings of the 36th International Conference on Machine Learning, pages 4595 4604. PMLR, May 2019.

Alex Nichol, Joshua Achiam, and John Schulman. On first-order meta-learning algorithms. ar Xiv preprint ar Xiv:1803.02999, 2018.

Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.

Michael T. Smith, Mauricio A. Álvarez, Max Zwiessele, and Neil D. Lawrence. Differentially Private Regression with Gaussian Processes. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 1195 1203. PMLR, 2018.

Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems, volume 30, 2017.

Xinyu Tang, Ashwinee Panda, Vikash Sehwag, and Prateek Mittal. Differentially Private Image Classification by Learning Priors from Random Processes. In Thirty-Seventh Conference on Neural Information Processing Systems, 2023.

Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artificial intelligence and statistics, pages 567 574. PMLR, 2009.

Marlon Tobaben, Aliaksandra Shysheya, John Bronskill, Andrew Paverd, Shruti Tople, Santiago Zanella Béguelin, Richard E. Turner, and Antti Honkela. On the efficacy of differentially private few-shot image classification. Transactions on Machine Learning Research, 2023.

Florian Tramèr, Gautam Kamath, and Nicholas Carlini. Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining. In Proceedings of the 41st International Conference on Machine Learning, pages 48453 48467. PMLR, 2024.

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.

Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, Graham Cormode, and Ilya Mironov. Opacus: User-friendly differential privacy library in Py Torch. ar Xiv preprint ar Xiv:2109.12298, 2021.

Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A. Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, and Huishuai Zhang. Differentially Private Fine-tuning of Language Models. In International Conference on Learning Representations, 2021.

Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov, and Alexander J Smola. Deep sets. In Advances in Neural Information Processing Systems, volume 30, 2017.

Xinyu Zhou and Raef Bassily. Task-level Differentially Private Meta Learning. In Advances in Neural Information Processing Systems, volume 35, 2022.

A Differential Privacy Details

A.1 Measure-Theoretic Details

Definition 3.1 is the typical definition of (ϵ, δ)-DP that is given in the literature, but it glosses over some measure-theoretic details that are usually not important, but are important for the functional mechanism. In particular, the precise meaning of measurable is left open. Here, we make the σ-field that measurable implicitly refers to explicit:

Definition A.1. An algorithm M is (ϵ, δ, A)-DP for a σ-field A if, for neighbouring datasets D, D and all A A, Pr(M(D) A) eϵ Pr(M(D ) A) + δ. (13)

Hall et al. [2013] point out that the choice of A is important, and insist that A be the finest σ-field on which M(D) is defined for all D. When the output of the mechanism is discrete, or an element of Rn, this corresponds with the σ-field that is typically implicitly used in such settings. When the output is a function, as in the functional mechanism, the choice of A is not as clear [Hall et al., 2013]. Note that A is similarly implicitly present in the definition of GDP (Definition 3.2).

Next, we recall the construction of the appropriate σ-field for the functional mechanism from Hall et al. [2013]. Let T be an index set. We denote the set of functions from T to R as RT . For S = (x1, . . . , xn) T n and a Borel set B B(Rn),

CS,B = {f RT | (f(x1), . . . , f(xn)) B} (14)

is called a cylinder set of functions. Let CS = {CS,B | B B(Rn)} and

S:|S|< CS. (15)

F0 is called the field of cylinder sets. (ϵ, δ, F0)-DP 2 amounts to (ϵ, δ, B(Rn))-DP for any evaluation of f at a finite vector of points (x1, . . . , xn) T n, of any size n N [Hall et al., 2013].

The σ-field for the functional mechanism is the σ-field F generated by F0 [Hall et al., 2013]. It turns out that (ϵ, δ, F0)-DP is sufficient for (ϵ, δ, F)-DP.

A.2 General Definition of the Functional Mechanism

Definition A.2. Let g be a sample path of a Gaussian process having mean zero and covariance function k. Let {f D : D D} RT be a family of functions indexed by datasets satisfying the inequality sup D D sup n< sup (x1,...,xn) T n

(x1,...,xn) D,D 2 , (16)

(x1,...,xn) D,D = M 1/2(x1, . . . , xn)

f D(x1) f D (x1) ... f D(xn) f D (xn)

where M(x1, . . . xn)ij = k(xi, xj). The functional mechanism with multiplier c and sensitivity is defined as M(D) = f D + cg. (17)

If f is a member of a reproducible kernel Hilbert space (RKHS) H with the same kernel k as the noise process g, the sensitivity bound of Definition A.2 is much simpler:

Lemma A.3 (Hall et al. 2013). For a function f in an RKHS H with kernel k,

Hf def = sup D D ||f D f D ||H . (18)

implies (16).

2This is a small abuse of notation, as F0 is not a σ-field.

A.3 Proof of Theorem 4.1

To prove Theorem 4.1, we need a GDP version of a lemma from Hall et al. [2013]:

Lemma A.4. Suppose that, for a positive definite symmetric matrix M Rd d, the function f : D Rd satisfies

sup D D ||M 1/2(f(D) f(D ))||2 . (19)

Then the mechanism M that outputs (Gaussian mechanism)

M(D) = f(D) + c Z, Z Nd(0, M)

is µ-GDP with c =

Proof. We can write

M(D) = M 1/2c M 1/2

c f(D) + S , S Nd(0, I).

Denote M (D) = M 1/2

c f(D) + S. M is a Gaussian mechanism with variance 1. Because of (19),

c f(D) has sensitivity

c f(D) M 1/2

so M is µ-GDP by Theorem 3.5. M is obtained by post-processing M , so it is also µ-GDP.

Theorem 4.1. The functional mechanism with sensitivity and multiplier c = /µ is µ-GDP.

Proof. Let T be the index set of the Gaussian process G, and let S = (x1, . . . , xn) T n. Then (G(x1), . . . , G(xn)) has a multivariate Gaussian distribution with mean zero and covariance Cov(G(xi), G(xj)) = K(xi, xj). Then the vector obtained by evaluating M(D) at (x1, . . . , xn) is µ-GDP by Lemma A.4, as (16) implies the sensitivity bound (19). Theorem 3.5 gives a curve of (ϵ, δ(ϵ))-bounds for all ϵ 0 from µ.

This holds for any S T n and any n N, so M is (ϵ, δ(ϵ), F0)-DP for all ϵ 0, which immediately implies (ϵ, δ(ϵ), F)-DP. This curve can be converted back to µ-GDP (with regards to F) using Theorem 3.5.

A.4 Functional Mechanism Sensitivities for DPConv CNP

To bound the sensitivity of r(d) and r(s) for the functional mechanism, we look at two neighbouring context sets D(c) 1 = ((x(c) n,1, y(c) n,1))N n=1 and D(c) 2 = ((x(c) n,2, y(c) n,2))N n=1 that differ only in the points

(x1, y1) D(c) 1 and (x2, y2) D(c) 2 . Let r(d)

D(c) i for i {1, 2} denote r(d) from (10) computed from

D(c) i , and define r(s)

D(c) i similarly.

Denote the RKHS of the kernel k by H. The distance in H between the functions kx1 = k(x1, ) and kx2 = k(x2, ) is given by

||kx1 kx2||2 H = kx1 kx2, kx1 kx2 H (20) = k(x1, x1) 2k(x1, x2) + k(x2, x2) (21) 2Ck. (22)

For the RBF kernel, this is a tight bound without other assumptions on x, as k(x, x) = 1 = Ck for all x and k(x1, x2) can be made arbitrarily small by placing x1 and x2 far away from each other.

The sensitivity of r(d) for the functional mechanism can be bounded with (22): for ,

2 Hr(d) = sup

D(c) 1 SD(c) 2

D(c) 1 r(d)

D(c) 1 SD(c) 2

n=1 (kx(c) n,1 kx(c) n,2)

= sup x1,x2 ||kx1 kx2||2 H (25)

where the second to last line follows from the fact that D(c) 1 and D(c) 2 only differ in one datapoint. This is a tight bound for the RBF kernel, because when x = x1, kx1(x) = 1 and kx2(x) = 1 can be a made arbitrarily small by moving x2 far away from x.

For r(s) and any function ϕ with |ϕ(y)| C, we first bound

||ϕ(y1)kx1 ϕ(y2)kx2||2 H (27)

= ϕ(y1)2k(x1, x1) 2ϕ(y1)ϕ(y2)k(x1, x2) + ϕ(y2)2k(x2, x2) (28)

4C2Ck. (29)

Again, these are tight bounds for the RBF kernel if we don t constrain x or y further.

The H-sensitivity for r(s) is then derived in the same way as the sensitivity for r(d), giving

2 Hr(s) 4C2Ck. (30)

A.5 Gaussian Mechanism for DPConv CNP

A naive way of releasing r(x) under DP is to first select discretisation points x1, . . . , xn, in some way, and release r(x1), . . . , r(xn) with the Gaussian mechanism. The components of r, r(s) and r(d), have the following sensitivities:

2r(d)(x) = sup

D(c) 1 SD(c) 2

D(c) 1 (x) r(d)

D(c) 1 SD(c) 2

n=1 (kx(c) n,1(x) kx(c) n,2(x))

= sup x1,x2 |kx1(x) kx2(x)|2 (33)

Line (33) follows from the fact that D(c) 1 and D(c) 2 only differ in one datapoint.

For r(s)(x), we have |ϕ(y1)kx1(x) ϕ(y2)kx2(x)|2 4C2C2 k. (35) Then we get 2r(s)(x) 4C2C2 k (36)

following the derivation of 2r(d)(x). For the RBF kernel, this is again tight without additional assumptions on y or x.

These sensitivities give the following privacy bound: Theorem A.5. Let 2 s = 4C2C2 k and 2 d = C2 k. Releasing n evaluations of r(x) = (r(d)(x), r(s)(x)) with the Gaussian mechanism with noise variance σ2 is µ-GDP for

n 2s + 2 d σ2 . (37)

Proof. Releasing n evaluations of r(x) is simply an n-fold composition of Gaussian mechanisms that release r(x) for one value. Releasing r(x) for one value is a composition of releasing r(d)(x) and r(s)(x), which have the sensitivities s and d. Now Theorems 3.5 and 3.4 prove the claim.

As µ scales with n, this method must add a large amount of noise for even moderate numbers of discretisation points.

The difference between having a factor of C2 k in the L2-sensitivities and Ck in the H-sensitivities is explained by the fact that the kernel also directly affects the noise variance for the functional mechanism, but it does not directly affect the noise variance with the Gaussian mechanism. This can be illustrated by considering what happens when the kernel is multiplied by a constant u > 0. This multiplies Ck by u, and multiplies the L2-sensitivities by u2. For the Gaussian mechanism, this means multiplying the noise standard deviation by u, but simultaneously multiplying all released values by u, which does not change the signal-to-noise ratio. For the functional mechanism, multiplying the kernel values effectively multiplies c by u and the squared sensitivities by u, which then cancel each other in µ.

For the RBF kernel and clipping function ϕ with threshold C = 1, we see that 2 Hr(d) = 2 while 2 2r(d) = 1, and 2 Hr(s) = 4, while 2 2 2r(s) = 4, so the functional mechanism adds noise with slightly more variance as releasing a single value with the Gaussian mechanism, so the functional mechanism adds noise of less variance when 2 or more discretisation points are required. However, the functional mechanism adds correlated noise, which is not as useful for denoising as the uncorrelated noise that the Gaussian mechanism adds.

A.6 Details on Figure 4

In this section, we go over the details of the calculations behind Figure 4. The classical line of the figure is computed from Theorem 3.7. The GDP line uses Definition 3.2 to convert the (ϵ, δ)-pair into a GDP µ bound by numerically solving for µ in Eq.(5). σ is then found with Theorem 4.1.

For the RDP line, we get an RDP guarantee from Corollary 2 of Jiang et al. [2023], which we convert to (ϵ, δ) with Proposition 3 of Jiang et al. [2023]. These give the equation

where α > 1 is a parameter of RDP that can be freely chosen. The α value that minimises ϵ is

2 + 1. (39)

Plugging α into Eq. (38) gives the quadratic equation

that can be solved for σ.

To see that choosing the α that minimises ϵ also leads to the smallest σ that satisfies a given (ϵ, δ)- bound, let

ϵ(α, σ) = α 2

and let σ be the solution to Eq. (40). Let α, σ be another pair that satisfies the (ϵ, δ)-bound. Since α is chosen to minimise ϵ, ϵ(α (σ), σ) ϵ(α, σ). We can assume that ϵ(α, σ) = ϵ(α (σ), σ) = ϵ, since otherwise we could reduce σ further. Now

ϵ(α (σ ), σ ) = ϵ = ϵ(α (σ), σ) (42)

so ϵ(α (σ ), σ ) = ϵ(α (σ), σ). By manipulating Eq. (40), we can see that ϵ(α ( ), ) is strictly decreasing, so this implies that σ = σ .

Algorithm 4 Efficient sampling of GP noise on a D-dimensional grid.

Input : Dimension-wise grid location ud R, spacing γd R and number of points Nd N, Product kernel k : RD RD R with factors kd : R R R. Output: Sample fn1...n D from GP with kernel k on grid inputs xn1...n D defined by ud, γd, Nd.

Sample fn1...n D N(0, 1) for each 1 nd Nd, d = 1, . . . , D {Sample Gaussian noise} for d = 1 to D do

Kdnm kd(ud + nγd, ud + mγd) for 0 n, m Nd 1 {Compute covariance} Ld CHOLESKY(Kd) {Compute Cholesky factor} f MATMULALONGDIM(Ld, f, d) {Matmul f by Ld along dimension d} end for

B Efficient sampling of Gaussian process noise

In order to ensure differential privacy within the DPConv CNP, we need to add GP noise (from a GP with an EQ kernel) to the functional representation outputted by the Set Conv. In practice, this is implemented by adding GP noise on the discretised representation, i.e. the data and density channels.

While sampling GP noise is typically tractable if the grid is one-dimensional, the computational and memory costs of sampling can easily become intractable for twoor three-dimensional grids. This is because the number of grid points increases exponentially with the number of input dimensions and, in addition to this, the cost of sampling increases cubically with the number of grid points. Fortunately, we can exploit the regularity of the grid and the fact that the EQ kernel is a product kernel, to make sampling tractable. Proposition B.1 illustrates how this can be achieved.

Proposition B.1. Let x RN1 ND be a grid of points in RD given by

xn1... n D = (u1 + (n1 1)γ1, . . . , u D + (n D 1)γD) , (43)

where ud R, γd R+ and 1 nd N Nd for each d = 1, . . . , D. Also let k : RD RD R be a product kernel, i.e. a kernel that satisfies

d=1 kd(zd, z d), (44)

for some univariate kernels kd : R R, for every z, z RD, let

Kdnm = kd(ud + (n 1)γd, ud + (m 1)γd), (45)

and let Ld be a Cholesky factor of the matrix Kd. Then if ϵn1... n D R N(0, 1) is a grid of corresponding i.i.d. standard Gaussian noise, the scalars fn1...n D R, defined as

fn1... n D =

k1=1 L1n1k1

k D=1 LDn Dk D ϵk1... k D, (46)

are Gaussian-distributed, with zero mean and covariance

C[fn1... n D, fm1... m D] = k(xn1... n D, xm1... m D). (47)

Before giving the proof of Proposition B.1, we provide pseudocode for this approach in Algorithm 4 and discuss the computation and memory costs of this implementation compared to a naive approach. Naive sampling on a grid of N1 ND points requires computing a Cholesky factor for the covariance matrix of the entire grid and then multiplying standard Gaussian noisy by this factor. We discuss the costs of these operations, comparing them to the efficient approach.

Computing Cholesky factors. The cost of computing a Cholesky factor for covariance matrix of the entire N1 ND grid incurs a computation cost of O(N 3 1 N 3 D) and a memory cost of O(N 2 1 N 2 D). By contrast, Algorithm 4 only ever computes Cholesky factors for Nd Nd covariance matrices, so it incurs a computational cost of O PD d=1 N 3 d and a memory cost of O PD d=1 N 2 d , which are both much lower than those of a naive implementation. For clarity, if

N1 = = ND = N, naive factorisation has O(N 3D) computational and O(N 2D) memory cost, whereas the efficient implementation has O(DN 3) computational and O(DN 2) memory cost.

Matrix multiplications. In addition, naively multiplying the Gaussian noise by the Cholesky factor of the entire covariance matrix incurs a O(N 2 1 N 2 D) computation cost. On the other hand, in Algorithm 4 we perform D batched matrix-vector multiplications, where the dth multiplication consists of Q

d =d Nd matrix-vector multiplications, where a vector with Nd entries is multiplied by an Nd Nd matrix. The total computation cost for this step is only O PD d=1 N 2 d Q

d =d Nd . For example, if N1 = = ND = N, naive matrix-vector multiplication has a computation cost of O(N 2D), whereas the efficient implementation has a computation cost of O(N D+1).

In Algorithm 4, CHOLESKY denotes a function that computes the Cholesky factor of a square positivedefinite matrix. MATVECALONGDIM(Ld, f, d) denotes the batched matrix-vector multiplication of an array f by a matrix Ld along dimension d, batching over the dimensions d = d. Specifically, given a D-dimensional array b with dimension sizes N1, . . . , ND and an Nd Nd matrix A, the matrix-vector multiplication of b by A along dimension d outputs the D-dimensional array

MATVECALONGDIM(A, b, d)n1...n D =

j=1 Andjbn1...nd 1 j nd+1...n D. (48)

From the above equation, we can see that initialising f with standard Gaussian noise, and batchmultiplying f by Ld along dimension d for each d = 1, . . . , D, amounts to computing the nested sum in Equation (46). Note that the order with which these batch multiplications are performed does not matter: it does not change neither the numerical result nor the computation or memory cost of the algorithm.

Proof of Proposition B.1. From the definition above, we see that fn1... n D is a linear transformation of Gaussian random variables with zero mean, and therefore also has zero mean. For the covariance, again from the definition above, we have

C [fn1... n D, fm1... m D] = (49)

k1=1 L1n1k1

k D=1 LDn Dk D ϵk1... k D,

l1=1 L1m1l1

l D=1 LDm Dl D ϵl1... l D

k D=1 L1n1k1 . . . LDn Dk D ϵk1... k D,

l D=1 L1m1l1 . . . LDm Dl D ϵl1... l D

k D=1 L1n1k1 . . . LDn Dk D ϵk1... k D

l D=1 L1m1l1 . . . LDm Dl D ϵl1... l D

where we have used the fact that the expectation of f is zero. Now, expanding the product of sums above, taking the expectation and using the fact that ϵn1... n D are i.i.d., we see that all terms vanish, except those where kd = ld for all d = 1, . . . , D. Specifically, we have

C [fn1... n D, fm1... m D] = (53)

l D=1 L1n1k1 . . . LDn Dk DL1m1l1 . . . LDm Dl D ϵk1... k Dϵl1... l D

l D=1 L1n1k1 . . . LDn Dk DL1m1l1 . . . LDm Dl D E [ϵk1... k Dϵl1... l D] (55)

l D=1 L1n1k1 . . . LDn Dk DL1m1l1 . . . LDm Dl D 1k1= l1,..., k D= l D (56)

k D=1 L1n1k1 . . . LDn Dk DL1m1k1 . . . LDm Dk D (57)

=0.25 =0.71 =2.00

0 100 200 300 400 500 N

0 100 200 300 400 500 N

0 100 200 300 400 500 N

Oracle No clip, no density noise Clip, no density noise Clip, density noise Lower Bound

Figure S1: DPConv CNP performance on the GP modelling task, where the data are generated using an EQ GP with lengthscale ℓ. We train three models per ϵ, ℓcombination, keeping δ = 10 3 fixed as well as the clipping threshold C = 2.00 and noise weight t = 0.50 fixed. Specifically, we train one model where only noise to the signal channel (red; no clip, no density), one model where noise and clipping are applied to the signal channel (orange; clip, no density noise) and another model where noise and clipping to the signal channel as well as noise to the density channel are applied (green; clip, density noise). We also show the NLL of the oracle, non-DP, Bayesian posterior, which is the best average NLL that can be obtained on this task (blue). Lastly, we show a bound to the functional mechanism (black), which is a lower bound on the NLL that can be obtained with the functional mechanism with C = 2.00, t = 0.50 on this task. We used 512 evaluation tasks for each N, ℓ, ϵ combination, and report mean NLLs together with their 95% confidence intervals. Note that the error bars are plotted but are too small to see in the plot.

k1=1 L1n1k1L1m1k1

k D=1 LDn Dk DLDm Dk D (58)

d=1 Kdndmd (59)

= k(xn1... n D, xm1... m D). (60)

which is the required result.

C Additional results

C.1 How effectively does the Conv CNP learn to undo the DP noise?

Quantifying performance gaps. In this section we provide some additional results on the performance of the DPConv CNP and the functional mechanism. Specifically, we investigate the performance gap between the DPConv CNP and the oracle (non-DP) Bayes predictor. Assuming the data generating prior is known, which in our synthetic experiments it is, the corresponding Bayes posterior predictive attains the best possible average log-likelihood achievable. We determine and quantify the sources of this gap in a controlled setting.

Sources of the performance gaps. Specifically, the performance gap can be broken down into two main parts: one part due to the DP mechanism (specifically the signal channel clipping and noise, and the density channel noise) and another part due to the fact that we are training a neural network to map the DP representation to an estimate of the Bayes posterior. To assess the performance gap introduced by each of these sources, we perform a controlled experiment with synthetic data from a Gaussian process prior (see Figure S1).

Gap quantification experiment. We fix the clipping threshold value at C = 2.00, which is a sensible setting since the marginal confidence intervals of the data generating process are 1.96. We also fix the noise weighting at t = 0.50, which is again is a sensible setting since it places roughly equal

importance to the noise added to the density and the signal channels. We consider three different settings for the prior lenghtscale (ℓ= 0.25, 0.71, 2.00) and two settings for the DP parameters (ϵ = 1.00, 3.00 and fixed δ = 10 3). For each of the six combinations of settings, we train three different DPConv CNPs, one with just signal noise (red; no clip, no density noise), one with signal noise and clipping (orange; clip, no density noise) and one with signal noise and clipping and also density noise (green; clip, density noise). Note, only the last model has DP guarantees. We compare performance with the non-DP Bayesian posterior oracle (blue).

Lower bound model. When we only add signal noise to the Conv CNP representation (and do not apply clipping or add density noise), and the true generative process is a GP such as in this case, the predictive posterior (given the noisy signal representation and the noiseless density representation) is a GP. That is because the data come from a known GP, and the signal channel is a linear combination of the data (since we have turned off clipping) plus GP noise, so it is also a GP. Therefore, we can write down a closed form predictive posterior in this case. We refer to this as the lower bound model (black) in Figure S1, because for a given C and t, the performance of this model is a lower bound to the NLL of any model that uses this representation as input. Note however that different lower bounds can be obtained for different C and t.

Conclusions. We observe that the DPConv CNP with no clipping and no density noise (red) matches the performance of the lower bound model. This is encouraging as it suggests that the model is able to undo the effect of the signal noise perfectly. We also observe that applying clipping (orange) does not reduce performance substantially, which is also encouraging as it suggests that the model is able to cope with the effect of clipping on the signal channel, when it is trained to do so. Lastly, we observe that there is an additional gap in performance is introduced due to noise in the density channel (green). This is expected since the density noise is substantial and confounds the context inputs. This gap reduces as the number of context points increases, which is again expected. From the above, we conclude that in practice, the model is able to make predictions under DP constraints that are near optimal, i.e. there is likely not a significant gap due to approximating the mapping from the DP representation to the optimal predictive map, with a neural network.

C.2 Supplementary model fits for the synthetic tasks

We also provide supplementary model fits for the synthetic, Gaussian and non-Gaussian tasks. For each task, we provide fits for three parameter settings (ℓand τ), two privacy budgets, four context sizes and two dataset random seeds. See Figures S2 to S5, at the end of this document, for model fits.

D Differentially-Private Sparse Gaussian Process Baseline

Here, we provide details of the differentially-private sparse variational Gaussian process (DP-SVGP) baseline.

Let D = (x, y) denote a dataset consisting of inputs N inputs x X N and N corresponding outputs y YN. We assume the observations are generated according to the probabilistic model:

f GP(0, kθ1(x, x ))

n=1 pθ2(yn|f(xn)), (61)

where kθ denotes the GP kernel from which the latent function f is sampled from, with hyperparameters θ, and θ2 denotes the parameters of the likelihood function. Computing the posterior distribution pθ(f|x, y) is only feasible when the likelihood is Gaussian. Even when this is true, the computational complexity associated with this computation is O(N 3).

Sparse variational GPs [Titsias, 2009] offer a solution to this by approximating the true posterior with the GP qθ,ϕ(f) = pθ(f =u|u)qϕ(u) (62)

with u = f(z), where z X M denote a set of M inducing locations, and qϕ(u) = N(u; m, S). The computational complexity associated with this posterior approximation is O(NM 2), which is significantly lower if M N. We can optimise the variational parameters ϕ = {m, S, z} by

optimisation of the evidence lower bound, LELBO:

LELBO(θ, ϕ) = Eqθ(f) [log pθ(y|f(x))] KL [qϕ(u) pθ(u)] . (63)

Importantly, LELBO also serves as a lower bound to the marginal likelihood pθ(y|x), and so we can optimise this objective with respect to both θ and ϕ. Since the likelihood factorises, we can obtain an unbiased estimate to the LELBO by sampling batches of datapoints. This lends itself to stochastic optimisation using gradient based methods, such as SGD. By replacing SGD with a differentially-private gradient-based optimisation routine (DP-SGD), we obtain our DP-SVGP baseline.

A difficulty in performing DP-SGD to optimise model and variational parameters of the DP-SVGP baseline is that the test-time performance is a complex function of the hyperparameters of DP-SGD (i.e. maximum gradient norm, batch size, epochs, learning rate), the initial hyperparameters of the model (i.e. kernel hyperparameters, and likelihood parameters), and the initial variational parameters (i.e. number of inducing locations M). Fortunately, we are considering the meta-learning setting, in which we have available to us a number of datasets that we can use to tune these hyperparameters. We do so using Bayesian optimisation (BO) to maximise the sum of the LELBO s for a number of datasets. To limit the number of parameters we optimise using BO, we set the initial variational mean and variational covariance to the prior mean and covariance, m = 0 and S = k(z, z).

In Table S1, we provide the range for each hyperparameter that we optimise over. In all cases, we fix the number of datasets that we compute the LELBO for to 64 and the number of BO iterations to 200. We use Optuna [Akiba et al., 2019] to perform the BO, and Opacus [Yousefpour et al., 2021] to perform DP-SGD using the PRV privacy accountant.

Hyperparameter Min Max

Max gradient norm 1 20 Epochs 200 1000 Batch size 10 128 Learning rate 0.001 0.02

Lengthscale 0.1 2.5 Period 0.25 4.0 Scale 0.5 2.0 Observation noise 0.05 0.25

Table S1: The ranges of DP-SGD hyperparameter settings (upper half) and initial model hyperparameters (lower half) over which Bayesian optimisation is performed for the DP-SVGP baseline.

E Experiment details

In this section we give full details for our experiments. Specifically, we describe the generative processes we used for the Gaussian, non-Gaussian and sim-to-real tasks.

E.1 Synthetic tasks

First, we specify the general setup that is shared between the Gaussian and non-Gaussian tasks. Second, we specify the Gaussian and non-Gaussian generative processes used to generate the outputs. Lastly we give details on the parameter settings for the amortised and the non-amortised models.

General setup. During training, we generate data by sampling the context set size N U[1, 512], then sample N context inputs uniformly in [ 2.00, 2.00] and 512 target inputs in the range [ 6.00, 6.00]. We then sample the corresponding outputs using either the EQ Gaussian process or the sawtooth process, which we define below. For the DPConv CNP we use 6,553,600 such tasks with a batch size of 16 at training time, which is equivalent to 409,600 gradient update steps. We do note however that this large number of tasks, which was used to ensure convergence across all variants of the models we trained, is likely unnecessary and significantly fewer tasks (fewer than half of what we used) suffices. Throughout optimisation, we maintain a fixed set of 2,048 tasks generated

in the same way, as a validation set. Every 32,768 gradient updates (i.e. 200 times throughout the training process) we evaluate the model on these held out tasks, maintaining a checkpoint of the best model encountered thus far. After training, this best model is the one we use for evaluation. At evaluation time, we fix N to each of the numbers specified in Figure 6, and sample N context inputs uniformly in [ 2.00, 2.00] and 512 target inputs in the range [ 2.00, 2.00]. We repeat this procedure for 512 separate tasks, and report the mean NLL together with its 95% confidence intervals in Figure 6. For all tasks, we set the privacy budget with δ = 10 3 and ϵ U[0.90, 4.00].

Gaussian generative process. For the Gaussian task, we generate the context and target outputs from a GP with the exponentiated quadratic (EQ) covariance, which is defined as

k(x, x ) = σ2 v exp (x x )2

Sawtooth generative process. For the non-Gaussian task, we generate the context and target outputs from a the truncation of the Fourier series of the sawtooth waveform

sin(2mπ(dx/τ) + ϕ)

where d U[ 1, 1] is a direction sampled uniformly from { 1, 1}, τ is a period and ϕ [0, 2π] is a phase shift. In preliminary experiments, we found that the DPConv CNP worked well with the raw sawtooth signal (i.e. the full Fourier series) but the DP-SVGP struggled due to the discontinuities of the original signal, so we truncated the series, giving an advantage to the DP-SVGP.

Non-amortised and amortised tasks. For the non-amortised tasks, we train and evaluate a single model for a single setting of the generative parameter ℓor τ. Specifically, for the Gaussian tasks, we fix ℓ= 0.50, 0.71 or 2.00 and train a separate model for each one, that is then tested on data with the same value of ℓ. Similarly, for the non-Gaussian tasks, we fix τ 1 = 0.25, 0.50 or 1.00 and train a separate model for each one, that is again then tested on data with the same value of τ. For the amortised tasks, we sample the generative parameter ℓor τ at random. Specifically, for the Gaussian tasks, we sample ℓ U[0.20, 2.50] for each task that we generate, and train a single model on these data. We then evaluate this model for each of the settings ℓ= 0.50, 0.71 or 2.00. Similarly, for the non-Gaussian tasks, we sample τ 1 U[0.20, 1.25] for each task that we generate, and train a single model on these data. We then evaluate this model for each of the settings τ 1 = 0.25, 0.50 or 1.00. The results of these procedures are summarised in Figure 6.

E.2 Sim-to-real tasks

For the sim-to-real tasks we follow a training procedure that is similar to that of the synthetic experiments. During training, we generate data by sampling the context set size N U[1, 512], then sample N context inputs uniformly in [ 1.00, 1.00] and 512 target inputs in the range [ 1.00, 1.00]. We then generate data by sampling them from a GP with covariance

k(x, x ) = k3/2,ℓ(x, x ) + σ2 n, (64)

where k3/2,ℓis a Matern-3/2 covariance with lengthscale ℓ U[0.50, 2.00] and noise standard deviation σn U[0.30, 0.80]. For all tasks, we set the privacy budget at δ = 10 3 and ϵ U[0.90, 4.00]. The Dobe !Kung dataset is publicly available in Tensor Flow 2 [Abadi et al., 2016], specifically the Tensorflow Datasets package. Note that we rescale the ages to be between 1.0 and 1.0 and normalise the heights and weights of users to have zero mean and unit standard deviation. We assume that the required statistics for these normalisations are public, but they could be released with additional privacy budget. Inaccurate normalisations would only increase the sim-to-real gap and reduce utility, but not affect the privacy analysis. At evaluation time, we fix N to each of the numbers specified in Figure 7. We then sample N points at random from the normalised !Kung dataset and use the remaining points as the target set. We repeat this procedure for 512 separate tasks, and report the mean NLL together with its 95% confidence intervals in Figure 7.

E.3 Optimisation

For all our experiments with the DPConv CNP we use Adam with a learning rate of 3 10 4, setting all other options to the default Tensor Flow 2 settings.

E.4 Compute details

We train the DPConv CNP on a single NVIDIA Ge Force RTX 2080 Ti GPU, on a machine with 20 CPU workers. Meta-training requires approximately 5 hours, with synthetic data generated on the fly. Meta-testing is performed on the same infrastructure, and timings are reported in Figure 5.

F DPConv CNP architecture

Here we give the details of the DPConv CNP architecture used in our experiments. The DPConv CNP consists of a DPSet Conv encoder, and a CNN decoder followed by a Set Conv decoder. We specify the details for the parameters of these layers below.

DPSet Conv encoder and Set Conv decoder. For all our experiments, we initialise the DPSet Conv and Set Conv lengthscales (which are also used to sample the DP noise) to λ = 0.20, and allow this parameter to be optimised during training. For the learnable DP parameter mappings t(µ, N) = sig(NNt(µ, N)) and C(µ, N) = exp(NNC(µ, N)) we use simple fully connected feedforward networks with two layers of 32 hidden units each. For the discretisation step in the encoder, we use a resolution of 32 points per unit for all our experiments. We also use a fixed discretisation window of [ 7, 7] for the synthetic tasks and [ 2, 2] for the sim-to-real tasks. We did this for simplicity, although our implementation supports dynamically adaptive discretisation windows.

Decoder convolutional neural network. Most of the computation involved in the DPConv CNP happens in the CNN of the decoder. For this CNN we used a bare-bones implementation of a UNet with skip connections. This UNet consists of an initial convolution layer processes the signal and density channels, along with two constant channels fixed to the magnitudes σs, σd of the DP noise used in these two channels, into another set of Cin channels. The result of the initial layer is then passed through the UNet backbone, which consists of N convolutional layers with a stride of 2 and with output channels C = (C1, . . . , CN), followed by N transpose convolutions again with a stride of 2 and output channels C = (CN, . . . , C1). Before applying each of these convolution layers, we create a skip connection from the input of the convolution layer and concatenate this to the output of the corresponding transpose convolution layer. Finally, we pass the output of the UNet through a final transpose convolution with Cout = 2 output channels, which are then smoothed by the Set Conv decoder to obtain the interpolated mean and (log) standard deviation of the predictions at the target points. For all our experiments, we used Cin = 32, N = 7 and Cn = 256. We used a kernel size of 5 for all convolutions and transpose convolutions.

N=50 =0.25 N=50 =0.71 N=50 =2.00

N=100 =0.25 N=100 =0.71 N=100 =2.00

N=250 =0.25 N=250 =0.71 N=250 =2.00

-4 -2 0 2 4 x

N=500 =0.25

-4 -2 0 2 4 x

N=500 =0.71

-4 -2 0 2 4 x

N=500 =2.00

N=50 =0.25 N=50 =0.71 N=50 =2.00

N=100 =0.25 N=100 =0.71 N=100 =2.00

N=250 =0.25 N=250 =0.71 N=250 =2.00

-4 -2 0 2 4 x

N=500 =0.25

-4 -2 0 2 4 x

N=500 =0.71

-4 -2 0 2 4 x

N=500 =2.00

Figure S2: Example model fits for the DPConv CNP on the EQ GP task. For all the above fits, a single amortised DPConv CNP is used, that is a DPConv CNP that has been trained on EQ GP data with randomly chosen lengthscales ℓ U[0.20, 2.50] and random privacy budgets, specifically ϵ U[0.90, 4.00] and δ = 10 3. The first four rows correspond to ϵ = 1.00 and the last four to ϵ = 3.00. We have fixed δ = 10 3. Note that column-wise the datasets are fixed, and we are varying the context set size N.

N=50 =0.25 N=50 =0.71 N=50 =2.00

N=100 =0.25 N=100 =0.71 N=100 =2.00

N=250 =0.25 N=250 =0.71 N=250 =2.00

-4 -2 0 2 4 x

N=500 =0.25

-4 -2 0 2 4 x

N=500 =0.71

-4 -2 0 2 4 x

N=500 =2.00

N=50 =0.25 N=50 =0.71 N=50 =2.00

N=100 =0.25 N=100 =0.71 N=100 =2.00

N=250 =0.25 N=250 =0.71 N=250 =2.00

-4 -2 0 2 4 x

N=500 =0.25

-4 -2 0 2 4 x

N=500 =0.71

-4 -2 0 2 4 x

N=500 =2.00

Figure S3: Same as Figure S2, but with a different dataset seed. Example model fits for the DPConv CNP on the EQ GP task. For all the above fits, a single amortised DPConv CNP is used, that is a DPConv CNP that has been trained on EQ GP data with randomly chosen lengthscales ℓ U[0.20, 2.50] and random privacy budgets, specifically ϵ U[0.90, 4.00] and δ = 10 3. The first four rows correspond to ϵ = 1.00 and the last four to ϵ = 3.00. We have fixed δ = 10 3. Note that column-wise the datasets are fixed, and we are varying the context set size N.

N=50 =1.00 N=50 =2.00 N=50 =4.00

N=100 =1.00 N=100 =2.00 N=100 =4.00

N=250 =1.00 N=250 =2.00 N=250 =4.00

-4 -2 0 2 4 x

N=500 =1.00

-4 -2 0 2 4 x

N=500 =2.00

-4 -2 0 2 4 x

N=500 =4.00

N=50 =1.00 N=50 =2.00 N=50 =4.00

N=100 =1.00 N=100 =2.00 N=100 =4.00

N=250 =1.00 N=250 =2.00 N=250 =4.00

-4 -2 0 2 4 x

N=500 =1.00

-4 -2 0 2 4 x

N=500 =2.00

-4 -2 0 2 4 x

N=500 =4.00

Figure S4: Example model fits for the DPConv CNP on the sawtooth task. For all the above fits, a single amortised DPConv CNP is used, that is a DPConv CNP that has been trained on sawtooth data with randomly chosen periods τ 1 U[0.20, 1.25] and random privacy budgets, specifically ϵ U[0.90, 4.00] and δ = 10 3. The first four rows correspond to ϵ = 1.00 and the last four to ϵ = 3.00. We have fixed δ = 10 3. Note that column-wise the datasets are fixed, and we are varying the context set size N.

N=50 =1.00 N=50 =2.00 N=50 =4.00

N=100 =1.00 N=100 =2.00 N=100 =4.00

N=250 =1.00 N=250 =2.00 N=250 =4.00

-4 -2 0 2 4 x

N=500 =1.00

-4 -2 0 2 4 x

N=500 =2.00

-4 -2 0 2 4 x

N=500 =4.00

N=50 =1.00 N=50 =2.00 N=50 =4.00

N=100 =1.00 N=100 =2.00 N=100 =4.00

N=250 =1.00 N=250 =2.00 N=250 =4.00

-4 -2 0 2 4 x

N=500 =1.00

-4 -2 0 2 4 x

N=500 =2.00

-4 -2 0 2 4 x

N=500 =4.00

Figure S5: Same as Figure S4, but with a different dataset seed. Example model fits for the DPConv CNP on the sawtooth task. For all the above fits, a single amortised DPConv CNP is used, that is a DPConv CNP that has been trained on sawtooth data with randomly chosen periods τ 1 U[0.20, 1.25] and random privacy budgets, specifically ϵ U[0.90, 4.00] and δ = 10 3. The first four rows correspond to ϵ = 1.00 and the last four to ϵ = 3.00. We have fixed δ = 10 3. Note that column-wise the datasets are fixed, and we are varying the context set size N.

=0.71, =0.33, =10 5 =0.71, =0.33, =10 3

0 100 200 300 400 500 N

=0.71, =1.00, =10 5

0 100 200 300 400 500 N

=0.71, =1.00, =10 3

=2.00, =0.33, =10 5 =2.00, =0.33, =10 3

0 100 200 300 400 500 N

=2.00, =1.00, =10 5

0 100 200 300 400 500 N

=2.00, =1.00, =10 3

Oracle DPConv CNP (non-amortised) DPConv CNP (amortised)

Figure S6: Additional results using the DPConv CNP on the EQ and sawtooth synthetic tasks with stricter DP parameters, namely all combinations of ϵ = {1/3, 1} and δ = {10 5, 10 3}. The overall setup in this figure is identical to that in Figure 6, except the amortised DPConv CNP is trained on randomly chosen ϵ U[1/3, 1] and fixed δ = 10 5 or 10 3, and the non-amortised DPConv CNP models are trained on ϵ and δ values as indicated on the plots. Then, both amortised and non-amortised models are evaluated with the parameters shown on the plots. The DP-SVGP baseline was not run due to time constraints in the rebuttal period: it is significantly slower and more challenging to optimise than the DPConv CNP. We note that the amortisation gap, due to training a model to handle a continuous range of ϵ values, is negligible. We also note that as the number of context points N increases, the performance of the DPConv CNP approaches that of the oracle predictors.

N=50, =0.71, =0.33, =10 3 N=50, =0.71, =1.00, =10 5

N=100, =0.71, =0.33, =10 3 N=100, =0.71, =1.00, =10 5

N=250, =0.71, =0.33, =10 3 N=250, =0.71, =1.00, =10 5

-4 -2 0 2 4 x

N=500, =0.71, =0.33, =10 3

-4 -2 0 2 4 x

N=500, =0.71, =1.00, =10 5

N=50, =2.00, =0.33, =10 3 N=50, =2.00, =1.00, =10 5

N=100, =2.00, =0.33, =10 3 N=100, =2.00, =1.00, =10 5

N=250, =2.00, =0.33, =10 3 N=250, =2.00, =1.00, =10 5

-4 -2 0 2 4 x

N=500, =2.00, =0.33, =10 3

-4 -2 0 2 4 x

N=500, =2.00, =1.00, =10 5

Figure S7: Illustrations of model fits on the synthetic EQ and sawtooth tasks, using stricter DP paramters, for different context sizes N. Left: model fits of amortised DPConv CNPs trained on EQ data using ϵ U[1/3, 1] and fixed δ = 10 3 (first column) or δ = 10 5 (second column) and evaluated on the DP parameters shown in the plots. Right: same as the left plot, except the data generating process is the sawtooth waveform rather than an EQ Gaussian process. We observe that the DPConv CNP produces sensible predictions even under strict privacy settings.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The main claims of the paper, summarised at the end of the introduction accurately reflect the contributions of the paper made in Sections 4 and 5. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Limitations are discussed in Section 6. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: All novel theorems include either the full proof, or a proof idea with the reference to the full proof in the Appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Yes, we give all necessary details in the main text and appendix E, and provide our implementations in the supplementary material.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code is included in the supplementary material, and the Dobe !Kung dataset [Howell, 2009] is freely available, e.g. through TF datasets Abadi et al. [2016]. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Yes, we give all necessary details in the main text and appendix E. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Error bars are reported when appropriate, and documented in the figure captions. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Yes, these details are provided in Appendix E.4. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The paper conforms to the Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: Broader impacts are discussed in Section 6. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper does not pose such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes] Justification: License for the Dobe !Kung dataset [Howell, 2009] is mentioned in the bibliography.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: Documentation is included in the code. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not include crowdsourcing experiments or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not include crowdsourcing experiments or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.