# kernelbased_evaluation_of_conditional_biological_sequence_models__999f5912.pdf

Kernel-Based Evaluation of Conditional Biological Sequence Models

Pierre Glaser 1 Steffanie Paul 2 Alissa M. Hummer 2 3 Charlotte M. Deane 3 Debora S. Marks 4 Alan N. Amin 5

We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model s estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, Protein MPNN. We are able to reject the hypothesis that Protein MPNN fits its data for various protein families, and tune the model s temperature hyperparameter to achieve a better fit.

1. Introduction

Conditional sequence models constitute one of the most prominent model classes of modern machine learning. Such models have allowed progress in longstanding problems in fields ranging from natural language generation to biomedical applications such as genomics and protein design. Abstracting away the precise nature of the data, the objective common to many of these problems can be summarized as the prediction of high-dimensional discrete-valued sequences, given some possibly high-dimensional input information about the sequence. For example, in protein design, inverse folding models (Dauparas et al., 2022) seek to learn the conditional distribution of amino acid sequences (proteins) that are likely to fold to a given input protein backbone 3D geometry, or structure.

1Gatsby Computational Neuroscience Unit, London, UK 2Systems Biology, Harvard Medical School, Boston, USA 3Department of Statistics, University of Oxford, Oxford, UK 4Harvard Medical School, Broad Institute, Boston, USA 5Courant Institute, New York University, New York, USA. Correspondence to: Pierre Glaser <pierreglaser@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

In such problems, it is crucial to evaluate the properties of the trained model. Model evaluation can help assess the risk of using the model s predictions in the real world, such as performing in-vitro experiments (a time-intensive process), guide hyperparameter searches, and deepen one s understanding of the model s behavior. Two properties are particularly important to measure: the first one is model accuracy, which describes how well the model approximates the true conditional distribution of the target variable given the input. Models with high accuracy have learned the underlying structure of the data, suggesting a high potential value in deploying them in real-world applications. However, in practice, it is likely that models will not be perfectly accurate. Inaccurate models can still be useful as long as they fall back to conservative guesses (in the extreme case, the prior distribution) when they are uncertain. From a statistical perspective, this property is known as reliability (Br ocker, 2008; Vaicenavicius et al., 2019; Widmann et al., 2021), and will be the second property of interest in this work.

Given a set of real samples, the standard approach to evaluate models in protein design consists in using loglikelihoods or sequence recovery (Dauparas et al., 2022; Hsu et al., 2022; Gao et al., 2022). However, log-likelihoods cannot be used to evaluate reliability, and are only relative measures of accuracy: these methods can only be used to compare models and would not alert the practitioner for example if all models make very poor predictions. Instead, to assess how far a model is from being optimally accurate and consistent and thus the potential value in improving it, by for example collecting more data or increasing its complexity, one should consider absolute rather than relative metrics, that is, metrics that not only allow one to compare models to each other, but also to evaluate a single model s performance without any other point of comparison. For these metrics to have practical value, they should come with estimators computable from data samples. These estimators should be efficiently computable, recover the true metric as in the large sample size limit (i.e. be consistent), and preferably be centered around the true value of the metric (i.e. be unbiased). Finally, to factor out the statistical error coming from estimating these metrics using a finite number of samples, these metrics should be integrable into hypothesis tests built to detect statistically significant mismatches

Kernel-Based Evaluation of Conditional Biological Sequence Models

between the model and the data.

Contributions In this work, we introduce a set of absolute evaluation metrics for measuring the accuracy and the reliability of conditional sequence models. Both our metrics are grounded in a new measure of divergence between conditional probability distributions, which we call the Augmented Conditional Maximum Mean Discrepancy (ACMMD), which extends the kernel-based conditional goodness-of-fit framework of Jitkrittum et al. (2020); Glaser et al. (2023); Widmann et al. (2021) to the case of sequencevalued variables. We analyze the statistical properties of our proposed metrics, which can be estimated using samples from the data and the model. Under certain conditions, we show that the ACMMD is able to detect any mismatch between the model and the data. In addition, we integrate the ACMMD into hypothesis tests to detect such mismatches from the model and the data samples. We showcase the utility of our methods by using them in an in-depth analysis of a popular inverse folding model - Protein MPNN (Dauparas et al., 2022). Our results demonstrate the theoretical properties of our methods, while also providing insight as to how to gauge the certainty and applicability of Protein MPNN for designing proteins of varying topologies and evolutionary families.

2. Problem Setting

We consider the problem of predicting a discrete sequencevalued variable we are designing Y Y, for example a biological sequence, conditionally on a variable X X at our disposal. The predicted sequence Y is allowed to have an arbitrary length, e.g. Y = ℓ=1Aℓ, where A is a finite set. In protein design, X could be the 3D structure of a protein (e.g. X = ℓ=1R3ℓ) and Y the sequence of amino acids making up the protein, in which case A is the set of amino acids. Given a large number of i.i.d measurements {Xi, Yi}NT i=1 from a distribution P(X, Y ), for example pairs of sequences and structures from the Protein Data Bank (Ingraham et al., 2019), we train a predictive model Q| : x 7 Q|x that takes in a value x and outputs a distribution on Y , Q|x(Y ) that attempts to match the true conditional P(Y |X = x), denoted P|x(Y ) in this work. After training, we are interested in quantifying how accurately Q| approximates P| on average across all values of x after training, using a held-out set of samples {Xi, Yi}N i=1 P(X, Y ). Quantifying the accuracy of Q| is known as the conditional goodness-of-fit problem, and we address it in Section 3. Furthermore, we will also be interested in quantifying the reliability of Q|, a task which we address in Section 4.

3. Conditional Goodness of Fit with ACMMD

In this section, we propose a metric that quantifies the accuracy of a predictive sequence model. We will show that this metric satisfies many desirable properties: first, it is absolute and able to detect any differences between conditional distributions. Second, it can be unbiasedly and efficiently estimated using samples from the model and the data distribution. Third, it can be used in hypothesis tests to detect statistically significant mismatches from such samples.

3.1. The Augmented Conditional MMD

We now propose a method to quantitatively evaluate the conditional goodness of fit of Q| to P|. Our approach consists in constructing a divergence D(P|, Q|), between the conditional distribution of Y given X and the model Q|. By definition, this divergence should satisfy:

(i) D(P|, Q|) 0

(ii) D(P|, Q|) = 0 P|x = Q|x, P(X) a.e. (1)

Combined, these two properties ensure that D(P|, Q|) is absolute, e.g. assigns the known value lowest value 0 to the best possible model, and is able to distinguish any mismatch between the model and the data, which is crucial to prevent blind spots in our evaluation. We borrow the idea of comparing Q| with P| by comparing the joint P(X, Y ) with a joint that keeps the same marginal P(X) but swaps P| with Q|. These two joint distributions are equal if and only if Q| and P| match almost everywhere. To compare these two distributions, we will use the Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) given by:

MMD(Q1, Q2) = sup f HZ f HZ 1

EQ1[f(Z)] EQ2[f(Z)].

(2) Here, Z is some measurable space, Q1 and Q2 are probability measures on Z, and HZ is a reproducing kernel Hilbert space (RKHS) of functions from Z to R with kernel k Z (Berlinet & Thomas-Agnan, 2011). Applying this general definition to the case at hand, we obtain a measure of accuracy for Q|, defined below. Definition 3.1 (Augmented Conditional MMD). Let (X, Y ) X Y with law PX P|. Let Q| be a conditional probability from X to Y. We define the Augmented Conditional MMD (ACMMD) between P| and Q| as:

ACMMD(P|, Q|) := MMD(PX P|, PX Q|) (3)

where the MMD is evaluated with a user-specified kernel k X Y on X Y. Here, PX P| is defined by (X, Y ) PX P| X PX, (Y |X = x) P|x, and similarly for PX Q|.

Choice of kernel for ACMMD The ACMMD requires specifying a kernel on the joint space X Y. In this work,

Kernel-Based Evaluation of Conditional Biological Sequence Models

we will focus on the case where k X Y is the tensor product kernel k X k Y of two kernels k X and k Y on X and Y respectively:

k X Y((x, y), (x , y )) = k X (x, x )k Y(y, y ) (4)

This choice is popular in practice, and the resulting ACMMD retains its desirable properties, as we show next.

The ACMMD is a divergence between conditional probabilities The ACMMD writes as divergence (which is symmetric, e.g. a distance) between joint distributions, while we seek to use it to compare conditional distributions. The following lemma shows that the same ACMMD can be formulated in alternative manner that highlights its purpose as a conditional distribution comparator.

Lemma 3.2. Under mild integrability conditions, we have:

ACMMD(P|, Q|) = TKX (µP| µQ|) HX,HY

Where µP| and µQ| are the conditional mean embeddings (Park & Muandet, 2020) of P| and Q|, KX (x, x ) := k X (x, x )IHY (here, IHY the identity operator) is an operator-valued kernel with associated vector-valued RKHS HX,HY L2 PX(X, HY), and TKX is its associated integral operator from L2 PX(X, HY) to HX,HY. Moreover, if k X and k Y are C0-universal 1, then it holds that:

ACMMD(P|, Q|) = 0 P|x = Q|x, PX-a.e.

The complete statement (with the full set of assumptions, and the definition of integral operators) and its proof can be found in Appendix A. Lemma 3.2 shows that the ACMMD can be understood as the result of a two-step procedure, given by (1) computing the conditional mean embedding µP| : x 7 EP|x [k Y(y, )|X = x] of P| (resp. of Q|), which is a function from X to HY, and (2) embed the difference of these conditional mean embeddings into the vector-valued RKHS HX,HY L2 PX(X, HY) with kernel KX , before returning its associated RKHS norm. The second part of the lemma gives sufficient conditions for the ACMMD to discriminate between any non (PX a.e) equal conditional distributions, fulfilling the requirements specified in Equation (1): these conditions are to use universal kernels k X and k Y. Regarding k Y, this requirement is not very restrictive, as many universal kernels on sequences have been shown to be universal (Amin et al., 2023a). The difficulty in finding a universal k X will depend on the space X (unspecified in this work) for the problem at hand.

1A kernel k is C0-universal if the associated RKHS Hk is dense in C0(X), the space of continuous functions on X vanishing at infinity (Sriperumbudur et al., 2010)

Estimating the ACMMD from data Crucial to this work is the fact that if the model Q| can be sampled from for any x X, ACMMD 2 will admit tractable unbiased estimators. To see this, we first rewrite ACMMD 2 in a form that will make this property apparent.

Lemma 3.3. Let Z := (X, Y, Y ) the triplet of random variables with law 2 PX P| Q|. Then, under the integrability assumptions of Lemma 3.2, we have that:

ACMMD2(P|, Q|) = EZ1,Z2 [h(Z1, Z2)]

where Z1, Z2 are two independent copies of Z and h is a symmetric function given by:

h(Z1, Z2) := k X (X1, X2)g((Y1, Y1), (Y2, Y2))

g((Y1, Y1), (Y2, Y2)) := k Y( Y1, Y2) + k Y(Y1, Y2)

k Y( Y1, Y2) k Y(Y1, Y2)

Lemma 3.3, proved in Appendix B.2, expresses ACMMD 2

as a double expectation given two independent samples of (X, Y, Y ) PX P| Q|. Leveraging this fact, we can derive an unbiased and consistent estimator for ACMMD 2.

Lemma 3.4. Let {Xi, Yi, Yi}N i=1 i.i.d PX P| Q| be samples from the data and the model. Then an unbiased estimator \ ACMMD2(P|, Q|) of ACMMD2(P|, Q|) is given by:

1 i<j N h((Xi, Yi, Yi), (Xj, Yj, Yj)) (5)

Lemma 3.4, proved in Appendix B.2, shows that it is possible to unbiasedly estimate ACMMD2 even when the analytical model expectations are intractable, provided that one can sample from the model. This estimator takes the form of a U-statistics (Serfling, 2009, Chapter 5) with symmetric probability kernel h, which are well-studied in the statistics literature. In particular, they provide a generic framework to obtain minimal-variance analogues of unbiased estimators (Serfling, 2009, Chapter 5, p. 176). \ ACMMD2 is a consistent estimator of ACMMD2: under the integrability assumptions of Lemma 3.2 the strong law of large numbers applies (Serfling, 2009, Section 5.4, Theorem A), and we have: \ ACMMD2(P|, Q|) a.s. N ACMMD2(P|, Q|). We

provide a more detailed characterization of the asymptotic distribution of \ ACMMD2 in Appendix B.

3.2. Testing Conditional Goodness of Fit with ACMMD

In the limit of infinitely many samples, a positive ACMMD means that the model and the data differ. However, in

2Identifying Q| with its analogue Markov kernel Q| from (X Y, X Y ) such that Q|(x,y)(dy ) := Q|x(dy ).

Kernel-Based Evaluation of Conditional Biological Sequence Models

practice, when only a finite number of samples are available, our estimate \ ACMMD2 is only a noisy version of the true ACMMD2, meaning we cannot conclude whether the model fits the data by directly inspecting its value. Instead, we need a procedure that accounts for the estimation noise; we achieve this by using the ACMMD as part of a hypothesis test deciding between two different hypotheses: ( H0 : ACMMD(P|, Q|) = 0 H1 : ACMMD(P|, Q|) > 0

In particular, we construct a test that takes as input a sample {Xi, Yi, Yi}N i=1 from the data and the model and outputs a (binary) decision to reject (or not) the null hypothesis H0 based on whether \ ACMMD2(P|, Q|) exceeds a certain threshold. Because of the estimation noise arising from the use of finitely many samples, such a test cannot systematically output the right decision. Nonetheless, we build our test to ensure a false rejection (e.g. reject H0 while P| = Q| a.e) rate of α (0, 1), a common practice in statistical testing (Gretton et al., 2012). To do so, we would like to set the rejection threshold q1 α to be an estimate of the 1 α quantile of the distribution of \ ACMMD2(P|, Q|) under H0. However, since q1 α is not available in closed form, we instead compute an estimate bq1 α using the wild bootstrap procedure (Arcones & Gin e, 1992). This procedure draws B samples { ACMMD2 b}B b=1 of the form:

ACMMD2 b := 2 N(N 1)

1 i<j N W b i W b j h(Zi, Zj) (6)

where {W b i }b=1...B i=1...N are i.i.d. Rademacher random variables independent of the data, from which we compute a quantile estimate bq1 α of this distribution of samples (see Appendix C for a precise definition of bq1 α). Importantly, this procedure guarantees an exact control of the false rejection rate at level α. We prove this fact in Appendix C.3, where we cast the wild bootstrap procedure as a Monte-Carlo estimation of the distribution of \ ACMMD2 when P| = Q|, which is valid non-asymptotically. Our test, which we call the ACMMD test, is summarized in Algorithm 1. To the best of our knowledge, this is the first conditional goodness-of-fit test that is applicable to sequence models.

4. Assessing Reliability with ACMMD

In practice, our model Q| may not fit the data perfectly, and it is important to distinguish (at a given level of inaccuracy) models that remain consistent with their training data from ones that fail more drastically. In this section, we show how the ACMMD can be used to evaluate model reliability, a statistical property capturing model and data consistency.

Problem Setting A model Q| is said to be reliable (Br ocker, 2008; Vaicenavicius et al., 2019; Widmann et al., 2021) if

Algorithm 1 ACMMD Conditional Goodness of fit Test

Input: {Xi, Yi, Yi}N i=1 i.i.d. PX P| Q| Parameters: Level α, kernel k X , kernel k Y // Estimate ACMMD using Equation (5)

\ ACMMD2 2 N(N 1)

h((Xi, Yi, Yi), (Xj, Yj, Yj))

Sample { ACMMD2 b}B b=1 using Equation (6)

bq1 α approx. (1 α)-quantile of { ACMMD2 b}B b=1 if \ ACMMD2 bq1 α then

Fail to reject H0 else

Reject H0 end if

the distribution of the target Y given that the model made a specific prediction q is this prediction q itself, e.g. if:

q = P Y | Q|X = q P(Q|X) a.e. (7)

Here, Q|X P(Y) (the space of probability distributions on Y) is the random variable obtained by evaluating the model Q| at a random value of the input variable X. Reliability differs from accuracy in that it does not require the model to learn all the information between X and Y , but only to make truthful predictions on average thus, by assessing reliability, one may be able to detect models that hallucinate non-realistic sequences (such as repeats of the same token) in regions of the input space where they are inaccurate, instead of making a conservative guess, such as falling back to the prior disitribution. In particular, reliability can be used as an additional criterion to discriminate between models that are equally accurate. From a theoretical perspective, reliability and accuracy can be handled in a unified manner: indeed, Equation 7 shows that reliability is defined as an equality between the conditional distribution of Y given a model prediction q, PQ |q := P(Y = |Q|X = q) and a model of this conditional distribution mapping q P(Y) to itself, e.g. QRel | : q 7 QRel |q = q. We thus propose to measure reliability using the ACMMD (a distance between conditional distributions) between QRel | and PQ | .

Definition 4.1 (ACMMD for Reliability). The Augmented Conditional MMD for reliability (ACMMD Rel) between P| and Q| as:

ACMMD Rel(P|, Q|):= ACMMD(PQ | , QRel | )

= MMD(P|Q|X PQ | , P|Q|X Q|) (8) where the ACMMD is evaluated with a user-specified kernel k P(Y) Y on P(Y) Y.

As for the ACMMD, we will restrict our attention to the case where k P(Y) Y is a tensor product kernel k P(Y) k Y

Kernel-Based Evaluation of Conditional Biological Sequence Models

between a kernel on P(Y) and a kernel on Y. Comparing the ACMMD Rel with the ACMMD, we see that the former requires specifying a kernel on the space of probability measures on sequences P(Y) instead of a kernel on X. Two important points must be addressed when working with such kernels. First, in order to have ACMMD Rel(P|, Q|) = 0 if and only if Q| is reliable, we must find universal kernels defined on P(Y). Second, as many kernels on probabilities are intractable, we must design an approximation strategy to estimate the ACMMD Rel from data.

ACMMD Rel can detect any pattern of unreliability Our first goal is to ensure that ACMMD Rel can detect any pattern of unreliability. As ACMMD Rel is a specific instance of the ACMMD, we can apply Lemma 3.2, stating that if the kernels k P(Y) and k Y are universal, then

ACMMD Rel(P|, Q|) = 0 Q| is reliable. (9)

The task of finding a universal kernel k Y on Y was addressed in Section 3; thus, it remains to find a universal kernel k P(Y) on P(Y). However, to the best of our knowledge, none of the existing kernels defined on probability distributions (Carmeli et al., 2010; Szab o et al., 2015; 2016; Meunier et al., 2022; Glaser et al., 2023) have been shown to be universal when Y is the space of arbitrary-length sequences. In the next proposition, we show that many such kernels can be constructed by following a simple recipe.

Proposition 4.2. Let k Y be a kernel on Y vanishing at infinity (on Y Y). Suppose that k Y has discrete masses, i.e. that δy HY for all sequences y Y, where δy is the Dirac function at y, and let σ > 0. Then the kernel k P(Y) on P(Y) defined as

k P(Y)(q, q ) := e 1 2σ2 MMD2(q,q ), (10)

(where the MMD is computed in HY) is a C0 universal kernel on the space of probability distributions P(Y) (under the topology of convergence in distribution or Total Variation, which are identical, see (Amin et al., 2021)).

The proof, provided in Appendix D.2, relies on an argument similar to prior work for universal kernels on probability measures (Carmeli et al., 2010), but tailored to the special case of sequences. Proposition 4.2 guarantees that any kernel on Y vanishing at infinity with the discrete mass property (Amin et al., 2023a) can be used to construct a universal kernel on P(Y). Kernels with discrete masses are studied in detail in (Amin et al., 2023a). In particular, the tilted Exponentiated Hamming Kernel 1 |y||y |e λd H(y,y ) (where |y| is the length of the sequence y) is a kernel with discrete masses vanishing at infinity on Y Y, and can thus be used to construct a universal kernel on P(Y).

Estimating ACMMD Rel from data To estimate ACMMD Rel from the data {Xi, Yi}N i=1 and samples from the model { Yi Q|Xi}N i=1, one may try to use the general ACMMD estimator proposed in Lemma 3.4, which, specialized to the reliability setting, is given by:

1 i<j N h(Zi, Zj)

h(Zi, Zj) := k P(Y)(Q|Xi, Q|Xj)g((Yi, Yi), (Yj, Yj))

This estimator requires evaluating k P(Y)(Q|Xi, Q|Xj) for pairs i, j. Unfortunately, exact evaluation of these quantities for the universal kernels proposed in Proposition 4.2 is in general impossible, as MMD2(Q|Xi, Q|Xj) contains intractable expectations under Q|Xi and Q|Xj. However, MMDs can be unbiasedly estimated using samples from Q|Xi and Q|Xj (Gretton et al., 2012; Schrab et al., 2022). Inspired by this fact, we propose the following estimator:

\ ACMMD Rel 2 := 2 N(N 1)

1 i<j N bh(Zi, Zj)

bh(Zi, Zj) := bkij g((Yi, Yi), (Yj, Yj))

Here, bkij is an approximation of k P(Y)(Q|Xi, Q|Xj) obtained by drawing R samples { Y r i }R r=1 and { Y r j }R r=1 from Q|Xi and Q|Xj, and replacing the MMD2(Q|Xi, Q|Xj)

term in k P by an unbiased estimate \ MMD2 ij computed from these samples. The full estimation procedure is provided in Algorithm 2. This additional approximation step has several implications: first, unlike \ ACMMD2, \ ACMMD Rel 2 is not unbiased. However, the bias of this estimator can be controlled by increasing the number of samples R used to estimate the MMD. Moreover, we show in the next proposition that the estimator \ ACMMD Rel 2 is still consistent provided that R is chosen appropriately. Proposition 4.3. Assume that k Y is bounded. Then, if R R(N), with lim N R(N) = + , \ ACMMD Rel 2

converges in probability to ACMMD Rel 2 as N .

Testing for reliability with ACMMD Rel As an ACMMD, ACMMD Rel has the potential to be used to test whether a model is reliable given some available data: to do so, one can use Algorithm 1, replacing \ ACMMD2 by \ ACMMD Rel 2, and performing quantile estimation using the bh(Zi, Zj) instead of the h(Zi, Zj). A full description of the algorithm is provided in Appendix D.3.1. An important question to answer is whether the approximation of using bh instead of h affects the false-rejection rate of the test. We show in the next proposition that this is not the case. Proposition 4.4. Assume that k Y is bounded, and k P(Y) is a kernel of the form of Equation 10. Then a reliability test using bh(Zi, Zj) instead of h(Zi, Zj) to estimate

Kernel-Based Evaluation of Conditional Biological Sequence Models

Algorithm 2 Estimating ACMMD Rel

Input: {Xi, Yi, Yi}N i=1 i.i.d. PX P| Q|, model Q| Parameters: kernel k Y for i in 1 to N do

[ Y r i Q|Xi for r in 1 to R ] end for for i, j in 1 to N do

%Use, e.g. Gretton et al. (2012, Equation 4)

\ MMD 2 ij := estimate mmd({ Y r i }R r=1, { Y r j }R r=1) bhij := e 1 2σ2 \ MMD 2 ij g((Yi, Yi), (Yj, Yj)) end for

return 2 N(N 1)

1 i<j N bhij

ACMMD Rel and its (1 α) quantile under H0 has a false-rejection rate of exactly α.

5. Related Work

Goodness-of-fit methods The goodness-of-fit problem is a well-studied problem in the statistics and machine learning literature, for which many methods were developed (Chwialkowski et al., 2016; Gorham & Mackey, 2017; Grathwohl et al., 2020; Amin et al., 2023b; Baum et al., 2023). Impressively, these methods can operate directly from the model s analytical form, without requiring access to samples from the model which may be hard to generate. In these works, goodness-of-fit is defined as the problem of evaluating the fit of unconditional models to their data, which is unlike the conditional goodness-of-fit problem we consider here. Evaluating conditional goodness-of-fit with kernels was recently studied in Jitkrittum et al. (2020). However, the proposed method requires the output space Y to be a subset of Rd, and is thus unsuitable for conditional sequence models. The use of conditional goodness-of-fit metrics to evaluate reliability was also done in Glaser et al. (2023), in a method also limited to continuous output spaces. Finally, we note that ACMMD Rel 2 recovers an existing calibration metric, the Squared Kernel Calibration Error (SKCE) of Widmann et al. (2021). However, the latter did not study the problems of universality, tractability and test validity in the case of sequence-valued outputs.

Deep Protein Design Models (Deep Learning powered) conditional probability models have gained significant momentum in computational biology during the last decade. In particular, such models have revolutionized the protein design field (Johnson et al., 2023; Bennett et al., 2023). Inverse folding models are trained on protein structures and sequences in the protein data bank (PDB) (Ingraham et al., 2019; Hsu et al., 2022; Dauparas et al., 2022). They con-

dition a sequence distribution on an input protein structure thus learning what sequences would likely fold into that structure. The designs from these methods have been shown to be highly stable and retain function (Sumida et al., 2024). However, many of the leaps made using these models have used small, simple structural scaffolds (like loop-helix-loop motifs) (Bennett et al., 2023; Watson et al., 2023). Protein engineers interested in leveraging these tools for novel scaffolds need to know how accurate and reliable the model is on average. If the model is too imprecise, one might wish to gather more data and train more bespoke models before using the method to design experiments.

6. Experiments

We now investigate the behavior and utility of the ACMMD and ACMMD Rel metrics and tests in practice. We start with a synthetic example showing that ACMMD is a natural measure of model distance. We then perform an extended analysis of a state-of-the-art inverse folding model, Protein MPNN. We show that ACMMD can detect small perturbations in the model, and that it can be used to tune its temperature parameter. Finally, we analyze the absolute performance of Protein MPNN.

6.1. A toy synthetic setting

We first study the behavior of ACMMD and ACMMD Rel in a synthetic setting where the data distribution and the model are simple generative models on sequences. We set the input variable X to be a single scalar p drawn from some distribution PX with support in (0.3, 0.5). Y is a sequence of arbitrary length with alphabet A = {A, B, STOP}. We set the conditional distribution of Y given p to be:

p(yn|y0:n 1, x = p) =

A with probability p B with probability p STOP with probability 1 2p

so long as yn 1 = STOP. The model distribution Q| is the same as the data, except for the fact that the first factor Q|p(y0) is perturbed by a parameter p:

A with probability p p B with probability p + p STOP with probability 1 2p

We set the kernel on Y to be the exponentiated Hamming distance kernel k Y(y, y ) = e d H(y,y ), where d H(y, y ) is the Hamming distance between y and y , and k X to be the Gaussian kernel k X (p, p ) = e 1

2 (p p )2. With such choices, it is possible to show that:

ACMMD(P|, Q|) = C| p|

Kernel-Based Evaluation of Conditional Biological Sequence Models

for some C > 0 that does not depend on p, and is computable in closed form for discrete priors on X. The proof and expression of C are given in Appendix D.4. From this expression, we immediately see that ACMMD is 0 only if p = 0, a manifestation of Lemma 3.2 which guarantees that the ACMMD can detect any mismatch between the model and the data when k X and k Y are universal. Moreover, in this case, the ACMMD depends monotonically on the shift | p|. Since | p| represents a natural measure of how different the model is from the data, this fact suggests that the ACMMD is a natural, well-behaved measure of model distance. Additionally, we plot the average rejection rate of the ACMMD test for various number of samples and shifts in Figure 1. The results for p = 0 confirm that our test has the correct specified type-I error rate (0.05). Moreover, we see that the power of the test increases with the number of samples, and the shift | p|.

0 500 1000 Num. Samples

ACMMD Estimates

ACMMD2 \ ACMMD2

0 500 1000 Num. Samples

ACMMD Test: Rejection Rate

α : 0.05 p : 0.0

p : 0.05 p : 0.07

p : 0.1 p : 0.2

Figure 1. Left panel: ACMMD estimates for a fixed shift value p = 0.25 and various number of samples in the synthetic example of Section 6.1. The analytic ACMMD value is given by the horizontal line. Right panel: ACMMD test average rejection rate for various number of samples and shifts in the same setting.

6.2. ACMMD Case Study: Inverse Folding Models

To demonstrate the utility of the ACMMD measures and tests, we apply them to evaluate inverse folding models, a popular model framework used in protein design. Inverse folding models seek a distribution of amino acid sequences that are likely to fold into a given input three-dimensional structure, as discussed in Section 3. We focus our experiments on evaluating Protein MPNN (Dauparas et al., 2022), a sampleable, commonly used model in this class. The sampling temperature T of Protein MPNN can also be varied, letting the user control the trade-off between accuracy and diversity of the generated sequences.

Data We leveraged the CATH taxonomy to select a set of diverse (in sequence and structural topologies) protein structures to perform our ACMMD test on. CATH is a taxonomy of protein structures that categorizes proteins according to a hierarchy of structural organization (Sillitoe et al., 2021). We used the S60 redundancy filtered set which includes

proteins that are at least 60% different in sequence identity from each other. Of these, we selected all single domain monomers (proteins where only one topological domain is found in the monomer), and removed any topologies that had fewer than 10 chains in its classification. This left us with 17,540 structures.

Choice of kernel Key to the performance of our metrics is the choice of the kernels k X , k Y and k P(Y). For k Y, we propose to use kernels that first embed each element or residue of a sequence y using an embedding function ϕY : A Y 7 Rd Y, and evaluating a euclidean kernel on Rd Y on the mean of the resulting embeddings, yielding a kernel of the form:

k Y(y, y ) = k Rd Y

i=1 ϕY(yi, y), 1

i=1 ϕY(y i, y )

where we noted y = (y1, . . . , y|y|). As the input space X is also sequence-valued, we follow the same recipe to construct our a kernel k X , using an embedding function ϕX : R3 X 7 Rd X . Finally, for the kernel on P(Y), we will use a kernel of the form of Equation (10), with kernel k Y described above to compute the inner MMD. We set our embedding functions ϕX and ϕY to a pair of recent pre-trained neural networks that are commonly used for representation learning of protein sequences and structures: Gearnet (Zhang et al., 2023) for structures, and ESM-2 (Lin et al., 2023) for sequences. Such two-step kernels allow us to instill the complex structure present in the distribution of protein structures and sequences within the ACMMD maximizing the performance and meaningfulness of our evaluation pipeline. Whether Proposition 4.2 holds for these kernels is an open question, but we find that they perform well in our experiments.

250 500 750 1000 Num. Samples

250 500 750 1000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

Figure 2. Values of \ ACMMD2, (left) and of the average rejection rate of the ACMMD test (right) in the setting described in 6.2.1. Each line corresponds to a different value for δT.

6.2.1. THE DISCRIMINATIVE POWER OF ACMMD

We first propose to evaluate the behavior of the ACMMD and its associated test when comparing a known ground

Kernel-Based Evaluation of Conditional Biological Sequence Models

truth and a model distribution differing from the ground truth in a controlled manner. To this end, we set the ground truth to be a pre-trained Protein MPNN model QT | with temperature T, and the model to be the same model QT +δT | with temperature T + δT. As Protein MPNN s probability distribution is a continuous function of T, small changes in T result in small changes in the predicted distribution which will be hard to detect, translating into low values for \ ACMMD2 relative to larger temperature changes. Conversely, we posit that large changes in T will result in large changes in the model distribution, and will be simpler to detect by the ACMMD. To test these hypotheses, we performed an estimation of ACMMD 2 for a ground truth temperature T = 0.1 (the default in the Protein MPNN documentation) and δT {0, 0.01, 0.05, 0.1}. We used the winged helix-like DNA binding domain superfamily (CATH ID: 1.10.10.10), and performed bootstrap sampling to produce dataset sizes ranging from 100 to 1000, and 100 different random seeds in order to obtain confidence intervals of our estimates.

250 500 750 1000 Num. Samples

\ ACMMD Rel2

250 500 750 1000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

Figure 3. Values of \ ACMMD Rel 2, (left) and of the average rejection rate of the ACMMD Rel test (right) in the setting described in Section 6.2.1. Each line corresponds to a different value for δT.

The results are shown in Figure 2. As expected, ACMMD 2(QT | , QT +δT | ) robustly increases with increasing values of δT. Additionally, we performed the ACMMD test of Section 3.2 with a target type-I error rate of α = 0.05, and 100 permutations to estimate the 1 α quantile of the null distribution for the same values of N and δT, and computed the average rejection rate of the null hypothesis H0 : ACMMD2(QT | , QT +δT | ) > 0, which, if k X and k Y are universal, is equivalent to H0 : δT = 0. The results, shown in Figure 2 (right), empirically confirm that the ACMMD test controls its type-I error rate and is able to detect differences in temperatures of an order relevant for Protein MPNN. Similarly, we evaluate the behavior of the ACMMD Rel, which is used to assess the (lack of) reliability of between model QT +δT | w.r.t the data P|X QT | , the

assumption being that QT +δT | is not reliable when δT = 0.

The results are shown in Figure 3, and exhibits similar behavior.

6.2.2. EVALUATION OF PROTEINMPNN ON THE CATH

Now that we have confirmed the discriminative power of the ACMMD on semi-synthetic data, we use our tests to evaluate Protein MPNN against real-world protein structures and sequences from the CATH dataset. We perform a wholedata evaluation, using samples of 5000 proteins across all families in the dataset. Then we perform a fine-grain evaluation on a subset of CATH superfamilies.

Whole-data Evaluation We first study the deviation of Protein MPNN from the true data by computing \ ACMMD2

and estimating its mean and variance by bootstrapping over 10 random seeds. We find that Protein MPNN with no temperature adjustment (T = 1.0) has an \ ACMMD2 value of 0.0916 (and a p-value < 0.01). Comparing this to the criterion values obtained on similar dataset sizes in the toy data experiments demonstrates that the model does not fit the test data. This suggests that there is still much room for improvement on solving the inverse folding problem.

On optimal temperature choices for Protein MPNN Practitioners vary the sampling temperature as a heuristic method for sampling more certain sequences from Protein MPNN; lower temperature settings have been found to generate sequences with fewer unrealistic artifacts (e.g. runs of alanines) which fold to more stable structures (Sumida et al., 2024). However, the relationship between sampling temperature, model reliability, and design accuracy has not been fully established. To thoroughly evaluate this, designs from different sampling temperatures conditioned on a diverse set of backbone structures would need to be experimentally characterized, which is resourse intensive in practice. We leverage the ACMMD to understand at what sampling temperature Protein MPNN best fits the data, which gives insight as to what temperature is optimum, by computing \ ACMMD2 and \ ACMMD Rel 2 for varying temperature values across 10 seeds for each temperature value. The results are shown in Figure 4.

First, we observe that reducing the temperature below 1.0 improves both the model s goodness-of-fit and its reliability. This corroborates the empirical design success of lowering the sampling temperature, suggesting that greater model fit may increase the quality of samples from the model. The decrease in reliability at higher temperature shows that even though increasing the temperature increases the diversity of the model s predictions, this diversity does not necessarily capture the one of the data distribution, as for instance the prior would. The optimal temperature from the perspective of goodness-of-fit is 0.4 (which lies outside the suggested

Kernel-Based Evaluation of Conditional Biological Sequence Models

temperature range of 0.1-0.3 in the Protein MPNN documentation (Dauparas et al., 2022)). However, we notice that model reliability continues to improve with even lower sampling temperatures while accuracy slightly increases, suggesting a trade-off between reliability and accuracy. Further experiments will determine how this trade-off manifests in the quality of designs from low-temperature settings.

10 2 10 1 100 Temperature

10 4 10 3 10 2 10 1 Temperature

\ ACMMD Rel 2

Figure 4. Evolution of \ ACMMD2 (left) and \ ACMMD Rel 2

(right) between a pre-trained Protein MPNN model and the CATH S60 reference dataset, for varying temperature values

Structural superfamily evaluation The (H)omologous superfamily tier within the CAT(H) hierarchy groups proteins with the same similar folds and sequence identity. While Protein MPNN has shown great performance in designing particular structural scaffolds, a practitioner aiming to leverage this model on a yet untested structural family may want some insight as to how well Protein MPNN may fit the distribution of proteins they are interested in. Thus, we performed ACMMD evaluation separately on individual superfamilies contained in our dataset to gain insights on what types of structures Protein MPNN does or does not fit well. We filtered the superfamilies for groupings with at least 500 proteins under a length of 100, yielding 11 families. The results are shown in Figure 5. We find that the model fit varies across families and the fit ranking is largely maintained at different temperatures. With no temperature adjustment (T = 1.0) the best fit superfamily (lowest \ ACMMD2) is the Homeodomain-like proteins (CATH ID: 1.10.10.60). These structures are largely dominated by helical bundles - a class of proteins that Protein MPNN has demonstrated success on designing (Dauparas et al., 2022; Watson et al., 2023; Bennett et al., 2023). While the Immunoglobulin superfamily has the highest fit at lower sampling temperatures, we note that most of an immunoglobulin structure consists of the beta sandwich of the framework, while, for antibody design, engineers are often most interested in the unstructured complementarity determining regions (CDRs) of antibodies (Kunik & Ofran, 2013; Liu et al., 2020; Jin et al., 2022). As the criterion is calculated across the entire sequence, this may not reflect that Protein MPNN has learned the distribution of CDR loops well. Further work will extend these tests to focus on subsequences of a domain to answer specific questions of model fit on regions of interest.

Homeodomain-like

PI 3-kinase

Acid Proteases

SH3 domains

WHL DNA binding

Immunoglobulins

Superfamily

Figure 5. Value of \ ACMMD2 between Protein MPNN and the CATH S60 reference dataset on a subset of 10 superfamilies for two different temperatures T = 1.0 and T = 0.1.

7. Discussion

Advancing the computational evaluation of conditional sequence models is crucial for accelerating the development of these methods for protein engineering. Given the limitations of current evaluation methods, and leveraging recent advancements in kernel methods for designing tests of goodness-of-fit and calibration, we propose a criterion and its associated test to principledly evaluate protein sequence models for how well they have learned input-conditioned sequence distributions. We discuss the statistical properties of our metrics and develop testing frameworks from them. Finally, we leverage them to investigate the performance of inverse folding models under default and temperatureadjusted settings. We develop novel insights on ideal temperature settings for Protein MPNN and discuss the trade-off between design accuracy and model calibration that our tests demonstrate for lower temperatures. Future work can perform a more fine-grained evaluation, for example investigating which structures in particular cause the model to make unreliable predictions and what features of the model s predictions do not match the data through the use of witness functions, a by-product of MMDs (Lloyd & Ghahramani, 2015). We also note that protein engineering goals may differ from pure modeling goals, and whether performance under our metrics reflect experimental design success rates requires further investigation to determine. Yet, barring orthogonal in silico validation data or experimental testing, our methods offer a powerful framework to test conditional sequence models for desirable statistical properties.

Impact Statement

The tools developed in this work assess the quality of sequences predictors. As such, they have the potential to influence various procedures in protein design, and, on longer timescales, healthcare. However, the conclusions that they

Kernel-Based Evaluation of Conditional Biological Sequence Models

provide are only statistical: while they are guaranteed to hold on average, they will not hold every time. Such tools should thus be used with caution, and in conjunction with external help from domain experts to ensure that the real-world actions they will influence remain beneficial to society.

Amin, A., Weinstein, E. N., and Marks, D. A generative nonparametric bayesian model for whole genomes. Neur IPS, 34:27798 27812, 2021.

Amin, A. N., Weinstein, E. N., and Marks, D. S. Biological sequence kernels with guaranteed flexibility. ar Xiv preprint ar Xiv:2304.03775, 2023a.

Amin, A. N., Weinstein, E. N., and Marks, D. S. A kernelized stein discrepancy for biological sequences. In Proceedings of the 40th ICML, 2023b.

Arcones, M. A. and Gin e, E. On the bootstrap of U and V statistics. The Annals of Statistics, 1992.

Baum, J., Kanagawa, H., and Gretton, A. A kernel stein test of goodness of fit for sequential models. In ICML, 2023.

Bennett, N. R., Coventry, B., Goreshnik, I., Huang, B., Allen, A., Vafeados, D., Peng, Y. P., Dauparas, J., Baek, M., Stewart, L., Di Maio, F., De Munck, S., Savvides, S. N., and Baker, D. Improving de novo protein binder design with deep learning. Nat. Commun., 2023.

Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics. 2011.

Br ocker, J. Some remarks on the reliability of categorical probability forecasts. Monthly Weather Review, 2008.

Carmeli, C., De Vito, E., Toigo, A., and Umanit a, V. Vector valued reproducing kernel Hilbert spaces and universality. Analysis and Applications, 2010.

Christmann, A. and Steinwart, I. Universal kernels on Non Standard input spaces. In Neur IPS, 2010.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In ICML, 2016.

Cuturi, M. and Blondel, M. Soft-dtw: a differentiable loss function for time-series. In ICML, 2017.

Cuturi, M., Vert, J.-P., Birkenes, O., and Matsui, T. A kernel for time series based on global alignments. In ICASSP, 2007.

Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Courbet, A., de Haas, R. J., Bethel, N., Leung, P. J. Y., Huddy, T. F., Pellock, S., Tischer, D., Chan, F., Koepnick, B., Nguyen, H., Kang,

A., Sankaran, B., Bera, A. K., King, N. P., and Baker, D. Robust deep learning-based protein sequence design using Protein MPNN. Science, 2022.

Dinculeanu, N. Vector integration and stochastic integration in Banach spaces. John Wiley & Sons, 2000.

Domingo-Enrich, C., Dwivedi, R., and Mackey, L. Compress then test: Powerful kernel testing in near-linear time. AISTATS, 2023.

Gao, Z., Tan, C., Chac on, P., and Li, S. Z. Pi Fold: Toward effective and efficient protein inverse folding. ar Xiv preprint ar Xiv:2209.12643, 2022.

Glaser, P., Widmann, D., Lindsten, F., and Gretton, A. Fast and scalable score-based kernel calibration tests. In UAI, 2023.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In ICML, 2017.

Grathwohl, W., Wang, K.-C., Jacobsen, J.-H., Duvenaud, D., and Zemel, R. Learning the stein discrepancy for training and evaluating energy-based models without sampling. In ICML, 2020.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch olkopf, B., and Smola, A. A kernel two-sample test. JMLR, 2012.

Hoeffding, W. On sequences of sums of independent random vectors. Springer, 1994.

Hsu, C., Verkuil, R., Liu, J., Lin, Z., Hie, B., Sercu, T., Lerer, A., and Rives, A. Learning inverse folding from millions of predicted structures. ICML, 2022.

Ingraham, J., Garg, V., Barzilay, R., and Jaakkola, T. Generative models for graph-based protein design. Neur IPS, 2019.

Jin, W., Barzilay, R., and Jaakkola, T. Antibody-Antigen docking and design via hierarchical equivariant refinement. ICML, 2022.

Jitkrittum, W., Kanagawa, H., and Sch olkopf, B. Testing goodness of fit of conditional density models with kernels. In UAI, 2020.

Johnson, S. R., Fu, X., Viknander, S., Goldin, C., Monaco, S., Zelezniak, A., and Yang, K. K. Computational scoring and experimental evaluation of enzymes generated by neural networks. 2023.

Kriege, N. M., Johansson, F. D., and Morris, C. A survey on graph kernels. Applied Network Science, 2020.

Kunik, V. and Ofran, Y. The indistinguishability of epitopes from protein surface is explained by the distinct binding preferences of each of the six antigen-binding loops. Protein Engineering, Design & Selection, 2013.

Kernel-Based Evaluation of Conditional Biological Sequence Models

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023.

Liu, G., Zeng, H., Mueller, J., Carter, B., Wang, Z., Schilz, J., Horny, G., Birnbaum, M. E., Ewert, S., and Gifford, D. K. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics, 2020.

Lloyd, J. R. and Ghahramani, Z. Statistical model criticism using kernel two sample tests. Adv. Neural Inf. Process. Syst., 2015-Janua:829 837, 2015.

Meunier, D., Pontil, M., and Ciliberto, C. Distribution regression with sliced Wasserstein kernels. In ICML, 2022.

Park, J. and Muandet, K. A measure-theoretic approach to kernel conditional mean embeddings. Neur IPS, 2020.

Saigo, H., Vert, J.-P., Ueda, N., and Akutsu, T. Protein homology detection using string alignment kernels. Bioinformatics, 2004.

Schrab, A., Kim, I., Guedj, B., and Gretton, A. Efficient aggregated kernel tests using incomplete u-statistics. Advances in Neural Information Processing Systems, 35: 18793 18807, 2022.

Serfling, R. J. Approximation theorems of mathematical statistics. John Wiley & Sons, 2009.

Sillitoe, I., Bordin, N., Dawson, N., Waman, V. P., Ashford, P., Scholes, H. M., Pang, C. S. M., Woodridge, L., Rauer, C., Sen, N., Abbasian, M., Le Cornu, S., Lam, S. D., Berka, K., Varekova, I. H., Svobodova, R., Lees, J., and Orengo, C. A. CATH: increased structural coverage of functional space. Nucleic acids research, 2021.

Sriperumbudur, B., Fukumizu, K., and Lanckriet, G. On the relation between universality, characteristic kernels and rkhs embedding of measures. In AISTATS, 2010.

Sumida, K. H., N u nez-Franco, R., Kalvet, I., Pellock, S. J., Wicky, B. I. M., Milles, L. F., Dauparas, J., Wang, J., Kipnis, Y., Jameson, N., Kang, A., De La Cruz, J., Sankaran, B., Bera, A. K., Jim enez-Os es, G., and Baker, D. Improving protein expression, stability, and function with Protein MPNN. Journal of the American Chemical Society, 2024.

Szab o, Z., Gretton, A., P oczos, B., and Sriperumbudur, B. K. Two-stage sampled learning theory on distributions. In AISTATS, 2015.

Szab o, Z., Sriperumbudur, B. K., P oczos, B., and Gretton, A. Learning theory for distribution regression. JMLR, 2016.

Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J., and Sch on, T. Evaluating model calibration in classification. In AISTATS, 2019.

Vert, J.-P., Saigo, H., and Akutsu, T. Local alignment kernels for biological sequences. Kernel methods in computational biology, 2004.

Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Hanikel, N., Pellock, S. J., Courbet, A., Sheffler, W., Wang, J., Venkatesh, P., Sappington, I., Torres, S. V., Lauko, A., De Bortoli, V., Mathieu, E., Ovchinnikov, S., Barzilay, R., Jaakkola, T. S., Di Maio, F., Baek, M., and Baker, D. De novo design of protein structure and function with RFdiffusion. Nature, 2023.

Widmann, D., Lindsten, F., and Zachariah, D. Calibration tests beyond classification. In ICLR, 2021.

Zhang, Z., Xu, M., Chenthamarakshan, V., Lozano, A., Das, P., and Tang, J. Enhancing protein language models with structure-based encoder and pre-training. ar Xiv preprint ar Xiv:2303.06275, 2023.

Kernel-Based Evaluation of Conditional Biological Sequence Models

Supplementary Material of the paper Kernel-Based Evaluation of Conditional Biological Sequence Models

A. Proof of Lemma 3.2

Let us first re-state the lemma in its complete form.

Lemma (Complete form of Lemma 3.2). Assume that X is locally-compact and second countable. Moreover, assume that k X Y = k X k Y, and that k X , k Y satisfy the integrability conditions E [k X (X, X)k Y(Y, Y )] < + and E [k Y(Y, Y )] < + (and similarly for Y ). Then,

ACMMD2(P|, Q|) = TKX (µP| µQ|) 2

Where µP| and µQ| are the conditional mean embeddings (Park & Muandet, 2020) of P| and Q|, given by: µP| : x 7 Ey P|xk Y(y, ) (and similarly for Q|), KX (x, x ) := k X (x, x )IHY is an operator-valued kernel with associated vectorvalued RKHS HX,HY L2 PX(X, HY), and TKX is its associated integral operator from L2 PX(X, HX,HY) to HX,HY, defined as

TKX f(x) = Z

X KX (x, x )f(x )PX(dx ) HY

for all f L2 PX(X, HY) and x X. Moreover, if k X and k Y are C0-universal 3

ACMMD(P|, Q|) = 0 µP|x = µQ|x, PX-a.e.

Proof. Let us introduce the notations used in this proof. Let (Ω, F, P) be the sample space, X(ω), Y (ω), Y (ω) being random variables on Ωcorresponding to the input, target and the model. When clear, we will identify the measure P and the push-forwards Y#P, Y#P and drop the dependence of Y, Y on ω. Given x X, we write Kx the linear operator from HY to L(X, HY), the space of linear operators from X to HY, such that (Kxf)(x ) = KX (x, x )f HY for all f HY. When no confusion is possible, we may identify the notations k Y(y, ) and ky.

The existence of the conditional mean embeddings µP| and µQ| is guaranteed by (Park & Muandet, 2020, Definition 3.1) under the integrability assumption R k Y(y, y)d P(y) < + and R k Y( y, y)d P( y) < + . The second integrability assumption R k X (x, x)k Y(y, y)d(PX P|)(x, y) = R k X k Y((x, y), (x, y))d(PX P|)(x, y) < + guarantees the existence of the mean embedding µPX P|, defined as:

Z k X (x, ) k Y(y, )d(PX P|)(x, y) HX HY

by (Gretton et al., 2012, Lemma 3) (and respectively for Q|). Here, HX HY is the tensor product Hilbert space of HX and HY, with kernel k X k Y. We actually prove a stronger form of the lemma, given by removing the norm from both hands of the equality and replacing it with a suitable isometric isomorphism ϕ : HX HY 7 HX,HY

ϕ( Z k X (x, ) k Y(y, )d(PX P|)(x, y) = Z KxµP|d PX(x)

This isometric isomorphism is shown to exist in the Currying lemma of Carmeli et al. (2010, Example 6) regarding tensor product kernels (note that both X by assumption and Y are locally compact and second-countable). This lemma shows that the mapping: ϕ : HX HY F(X, HY)

f g 7 ϕ(f g) = (x X 7 f(x)g HY)

is an isometric isomorphism between HX HY and HX,HY. This lemma gives both a representation formula for elements of HX,HY, and a way to formalize the currying operation, (e.g. the transformation of a function of two variables into a higher-order function of one variable and returning a function of one variable) on tensor-product spaces, since

3A kernel k is C0-universal if the associated RKHS Hk is dense in C0(X), the space of continuous functions on X vanishing at infinity (Sriperumbudur et al., 2010)

Kernel-Based Evaluation of Conditional Biological Sequence Models

(f g)(x, y) = (ϕ(f g)(x))(y). We refer to Carmeli et al. (2010, Example 6) for a proof. Proceeding with the proof of Lemma 3.2, when f and g are kernel functions k X (x, ) and k Y(y, ), the right-hand side of the equality can be related to Kx as ϕ(k X (x, ) k Y(y, ))(x ) = k X (x, x )k Y(y, )

= K x Kxk Y(y, )

where the second to last equality follows from the reproducing property of HX,HY. Since ϕ is linear and unitary, it commutes with the mean embedding operation: (Dinculeanu, 2000, Theorem 36), yielding:

ϕ( Z (k X (x, ) k Y(y, ))d(PX P|)(x, y)) = Z ϕ(k X (x, ) k Y(y, ))d(PX P|)(x, y)

= Z Kxkyd(PX P|)(x, y)

To complete the proof, it remains to relate the right-hand side to the conditional mean embedding µP|, using Z Kxkyd(PX P|)(x, y) = ZZ Kxkyd PX(x)d P|x(y)

Z kyd P|x(y)d PX(x)

= Z KxµP|(x)d PX(x)

as Kx is a bounded linear operator. We thus have that:

ϕ( Z k X (x, ) k Y(y, )d(P|X P|)(x, y) = Z KxµP|d PX(x)

Combining this with the analogue of this result holding for µQ| allows to show the stronger form of Lemma 3.2. Let us now prove the second part of the lemma. Assume ACMMD(P|, Q|) = 0, meaning

TKX (µP| µQ|) = 0

By Carmeli et al. (2010, Theorem 2) KX is a C0-universal operator-valued kernel, the operator TKX is injective. This implies that the conditional mean embeddings of P|x and Q|x are equal PX almost everywhere. By Park & Muandet (2020, Theorem 5.2) applied to the case where the marginals are equal, and since k X k Y is C0 universal, this implies that P|x = Q|x, PX almost everywhere, and in summary, ACMMD(P|, Q|) = 0 implies P|x = Q|x, PX almost everywhere. To prove the reverse direction, assume that P|x = Q|x, PX almost everywhere Since Park & Muandet (2020, Theorem 5.2) also prove the reverse direction of the statement relied upon in the previous argument, we have that conversely µP|(x) = µQ|(x), PX almost everywhere. By linearity of TKX , we thus have that TKX (µP| µQ|) = 0, and therefore ACMMD(P|, Q|) = 0.

B. Asymptotic distribution of \ ACMMD2

As discussed in the main text, it is possible to characterize the asymptotic distribution of N \ ACMMD2. when P| = Q|, and

N( \ ACMMD2 ACMMD2) when P| = Q|. This characterization is given in the next lemma. Lemma B.1. Assume that the integrability assumptions of Lemma 3.2 hold, and that EZ1,Z2h(Z1, Z2)2 < + , and that

if P|x = Q|x P(X) a.s, then E[ \ ACMMD] = 0 and

N \ ACMMD2 d

j=1 λj(χ2 1j 1)

where {χ2 1j} j=1 are independent random χ2 1 variables, and λj are the eigenvalues of the operator defined as:

ϕ 7 Z h(z, )ϕ(z)d P(z)

Kernel-Based Evaluation of Conditional Biological Sequence Models

Assume moreover that k X and k Y are C0-universal kernels, and that σ2 h = 4VZ2[EZ1h(Z1, Z2)] > 0. Then

N( \ ACMMD2 ACMMD 2]) d N(0, σ2 h)

Proof. Since EZ1,Z2h(Z1, Z2)2 < + we have that VZ1,Z2EZ1,Z2h(Z1, Z2)2 < + . Let us define, as in Serfling (2009, Section 5.1.5), the function h(z) = EZ2h(z, Z2), and define ζ := Vzh. For the first point, we will show that if P| = Q|, P a.s, then ζ = 0, and the result will follow from Serfling (2009, Setion 5.5.2). Indeed, noting k = k X k Y,

h(z) = EZ2h(z, Z2)

= EZ2 D k((x, y), ) k((x, y ), ), k((X2, Y2), ) k(X2, Y2) E

= D k((x, y), ), k((x, y ), ), Ez2 h k((X2, Y2), ) k((X2, Y2), ) i E

Where we exchanged the order of integration and inner product, which is possible since h 7 k((x, y), ) k((x, y), ), h is a bounded linear functional for all (x, y, y). Now,

Ez2k((X2, Y2), ) k((X2, Y2), ) = EPX h EP|k((X2, Y2), ) EQ|k((X2, Y2), ) i = 0

since P|x = Q|x PX a.s. Thus, h(z) is a constant function, and ζ = 0. The second case follows by assumption from Serfling (2009, Section 5.1.1).

B.1. Proof of Lemma 3.3

Let HX Y := HX HY be the tensor-product RKHS of functions from X Y with kernel k X k Y. The result can be obtained by applying a coupling argument, and starting from the following object:

µPX P| PX Q| := Z (k((x(ω), y(ω)), ) k((x(ω), y(ω)), )) d P(ω)

= Ex,y, y [k((x, y), ) k((x, y), )] (12)

We first show that µPX P| PX Q| is a well-defined element of HX Y. Indeed, the following operator

T : f H 7 Ezf((x, y)) f((x, y))

satisfies |Tf| E [|f(x, y)| + |f(x, y)|]

f HX Y (E p

k((x, y), (x, y))

k((x, y), (x, y)))

and is bounded thanks to the integrability assumptions of Lemma 3.2. Applying the same argument as (Gretton et al., 2012, Lemma 3), it follows that the object in Equation (12) is well-defined and belongs to HX Y. Furthermore, by linearity of integration, we have that: Z (k((x(ω), y(ω)), ) k((x(ω), y(ω)), )) d P(ω)

= Z k((x(ω), y(ω)), )d P(ω) Z k((x(ω ), y(ω )), )d P(ω )

= µPX P| µPX Q|

To conclude, note that:

ACMMD(P|, Q|)2 = µPX P| PX Q| 2

HX Y = µPX P| PX Q|, µPX P| PX Q|

HX Y = Ex,y, y [k((x, y), ) k((x, y), )] , Ex,y, y [k((x, y), ) k((x, y), )] HX Y = Ex1,y1, y1Ex2,y2, y2h((x1, y1, y1), (x2, y2, y2))

Kernel-Based Evaluation of Conditional Biological Sequence Models

Where the last equality was obtained by exchanging the order of integration and dot product, possible thanks to the integrability assumptions of Lemma 3.2, by using the bilinearity of the inner product and the reproducing property of the kernel k. The symmetry of h in (Z1, Z2) follows from the symmetry of k X k Y.

B.2. Proof of Lemma 3.4

Proof. The proof of the unbiasedness of \ ACMMD2 follows by linearity of the expectation, and that each h((Xi, Yi, Yi), (Xj, Yj, Yj)) is an unbiased estimator of ACMMD 2(P|, Q|).

C. Type-I error control of the ACMMD test

The goal of this section is to show that the ACMMD test is guaranteed to control its type-I error rate at level α.

C.1. Quantile estimation and Decision Rule

We first fully specify the way we compute our quantile estimate bq1 α. Let bα := (1 α)(B + 1) . Given B bootstrap samples { ACMMD2 b}B b=1 and an \ ACMMD2 estimate, we order them in increasing order in a sequence of size B + 1, with ties broken arbitrarily. Let m = min{b [[1, B + 1]] | ACMMD2 b = ACMMD2 bα}, and M = max{b [[1, B +

1]] | ACMMD2 b = ACMMD2 bα}. We set bq1 α to be the (m 1)-th element with probability (bα (1 α)(B + 1))/(M m + 1) (with the convention that the 0-th element is ), and the bα-th element otherwise The decision rule is then to reject the null hypothesis if \ ACMMD2 > q1 α.

C.2. Wild-bootstrap and permutation-based approaches are equivalent in the ACMMD test

To show that the ACMMD test is guaranteed to control its type-I error rate at level α, we show that the use of a wild bootstrap procedure in the ACMMD test can be cast as a computationally efficient way to approximate the quantiles of the random variable \ ACMMD2 when P|x = Q|x PX a.e.

Lemma C.1. Let {Zi}N i=1 be i.i.d realizations of PX P| Q|, and let {W b i }b=1...B i=1...N be i.i.d. Rademacher random variables independent of the data. Given a function σ : [[1, N]] 7 { 1, 1}, define {Zσ i }N i=1 := {Xi, Y σ i , Y σ i }N i=1, where (Y σ i , Y σ i ) = (Yi, Yi) if σ(i) = 1, and ( Yi, Yi) otherwise. Then we have:

ACMMD2 b = 2 N(N 1)

h(Zσb i , Zσb j ) := \ ACMMD2 σb

for σb(i) := W b i .

The W b i should be understood as elements of a random swap σb, which for each i, swaps Yi and Yi with probability 1/2.

Proof. Without loss of generality, we fix i = 1 and j = 2, and fix b, dropping the b index. Note that h(Z1, Z2) and h(Zσ 1 , Zσ 2 ) share the same k X (X1, X2). The only differing term is

g((Y1, Y1), (Y2, Y2)) := k Y(Y1, Y2)) + k Y( Y1, Y2) k Y(Y1, Y2) k Y( Y 1, Y 2)

and we only need to show that W1W2g((Y1, Y1), (Y2, Y2)) = g((Y σ 1 , Y σ 1 ), (Y σ 2 , Y σ 2 )).

Case W 1 = W 2 = 1 In that case, Z1 = Zσ 1 and Z2 = Zσ 2 , and W1W2h(Z1, Z2) = h(Z1, Z2) = h(Zσ 1 , Zσ 2 ). by definition of σ.

Case W1 = W2 = 1 In that case, we have:

g((Y σ 1 , Y σ 1 ), (Y σ 2 , Y σ 2 )) = k( Y1, Y2) + k(Y1, Y2)k( Y1, Y2) k(Y1, Y2)) = h(Z1, Z2)

implying again W1W2h(Z1, Z2) = h(Z1, Z2) = h(Zσ 1 , Zσ 2 ).

Kernel-Based Evaluation of Conditional Biological Sequence Models

Case W1 = 1 and W2 = 1 In that case, we have:

h(Zσ 1 , Zσ 2 ) = k((X1, Y1), (X2, Y2)) + k((X1, Y1), (X2, Y2)) k((X1, Y1), (X2, Y2)) k((X1, Y1), (X2, Y2))

= h(Z1, Z2)

meaning again W1W2h(Z1, Z2) = h(Z1, Z2) = h(Zσ 1 , Zσ 2 ), and the last case is proved similarly.

C.3. Level of the ACMMD test

We now show that the ACMMD test has the desired type-I error rate.

Lemma C.2. Assume that P|x = Q|x PX a.s. Then the probability that the ACMMD test rejects the null hypothesis is exactly α.

The proof consists in 2 steps. First, we show that the decision rule is equivalent to a simpler one. Then, we analyze the latter decision rule.

An equivalent decision rule This decision rule is equivalent to the one rejecting H0 if the position Q (with ties broken uniformly at random) of \ ACMMD2 in that sequence satisfies Q > bα, accepting it if Q < bα, and rejecting it with probability bα (1 α)(B + 1) if Q = bα: Indeed, Q > M \ ACMMD2 > q1 α (we always reject), Q < m \ ACMMD2 q1 α (we never reject), and for both rules, when the random position Q is in [[m, M]], the null is rejected with probability (bα α(B + 1))/(M m + 1).

Analysis of the decision rule We derive the type-I error of our decision rule by analyzing the equivalent, latter one. Our analysis follows a similar argument, in flavor, as Domingo-Enrich et al. (2023, Appendix C). Now, recall that from Lemma C.1, the wild bootstrap quantile estimation are draws of \ ACMMD on swapped samples Zσ, e.g. {Xi, Y σ i , Y σ i }N i=1 parameterized by σ : [[1, N]] 7 { 1, 1} where Y σ i = Yi if σ(i) := wi and Y σ i = Yi otherwise:

( ACMMD2 b)B b=1 = ( \ ACMMD2 σb)B b=1

using the notation of Lemma C.1. Note that σ is a random swap operator such that σ(i) = 1 with probability 0.5, and σ(i) = 1 with probability 0.5. If P|x = Q|x a.e., then since the B swap maps σ1, . . . , σB are i.i.d. let us note \ ACMMD2 σ0 = \ ACMMD2, e.g. σ0(i) = 1. Then the random sequence ( \ ACMMD2 σb)B b=0 is exchangeable. Since Q is the

position of \ ACMMD2 within that sorted sequence, and that all positions are equally likely under exchangeability, we have:

P [Q < m] = 1/(B + 1)

P [Q > bα] = (B + 1 bα) /(B + 1)

P [Q < bα] = (bα 1) /(B + 1)

Noting ((Xi, Y i, Y i)N i=1) the event that the null hypothesis is rejected, we have:

P h ((Xi, Y i, Y i)N i=1) i = P [Q > bα] + P [Q = bα] P[Reject|Q = bα]

= (B + 1 bα) /(B + 1) + (bα (1 α)(B + 1)) /(B + 1) = α,

thus showing that the ACMMD test has the desired type-I error rate.

D. Proofs related to ACMMD Rel

D.1. Differences between the SKCE U-statistics and the ACMMD U-statistic

We recall the definition of the SKCE U-statistics estimator from (Widmann et al., 2021, Lemma 2):

\ SKCE = 2 N(N 1)

1 i<j N G((Q|Xi, Yi), (Q|Xj, Yj)) (13)

Kernel-Based Evaluation of Conditional Biological Sequence Models

G((q, y), (q , y )) := k P(Y) Y((q, y), (q , y )) EY qk P(Y) Y((q, Y ), (q , y ))

EY q k P(Y) Y((q, y), (q , Y )) + EY q EY q k P(Y) Y((q, Y ), (q , Y )).

= k P(Y)(q, q ) (k Y(y, y ) EY qk Y(Y, y ) EY q k Y(y, Y ) + EY q EY q k Y(Y, Y )) (14)

Where the second equality holds when focusing on tensor product kernels. Comparing Equation (13) and Equation (14) with the expression of the ACMMD U-statistics estimator given in Equation (5) and Lemma 3.3, we see that the SKCE population criterion equals the ACMMD. However, the SKCE U-statistics estimator is different from the ACMMD U-statistics estimator: while the ACMMD U-statistics only requires samples the conditional distributions Q|X, the SKCE U-statistics contains expectations over the conditional distributions Q|X, which are rarely available in practice.

D.2. Proof of Proposition 4.2

Proof. We will show that the image of q 7 µq is compact, and the result will follow from (Christmann & Steinwart, 2010). Let M(Y) the Banach space of measures of sequences endowed with the total variation norm:

q TV := q+(Y) + q (Y)

We recall that by the Riesz-Markov theorem, (M(Y), TV) can be identified with the topological dual of C0(Y), (C0(S) , op) through an isometric isomorphism q M(Y) 7 q C0(Y) , and for which the following holds:

Y fdq, f C0(Y).

Let B := {q M(Y) | q TV 1}. As a unit ball, by the Banach-Alaoglu theorem, B is compact under the weak topology and contains all distributions on sequences. We will show that q 7 µq is continuous on B and the result will follow. Given that this mapping is linear, it is sufficient to show continuity at 0. Moreover, since {B(0H, r)}r>0 is a neighborhood basis of (H, H), it suffices to show that there is a neighborhood V of the null measure in the weaktopology such that R k Y(y, )dq(y) 2 H = R k Y(y, y )d(q q)(y, y ) < 1 for all q in V. Since the family

{q M(Y), Z fi(x)dq(x) < ϵ, i 1, . . . k, fi C0(Y)}

form a neighborhood basis of the weaktopology, we can consider candidates of this form for V. In particular, let us set {fi} = {x 7 p

k Y(x, x) C0(Y)}, since k Y C0(Y Y), and ϵ = 0.5. On this neighborhood, we have: Z k(y, y )dq(y)dq(y ) Z p

k(y, y) k(y , y )dq(y)dq(y ) 0.52 < 1, q V

showing the continuity of the map in question. As a consequence, the image of BM(S)(0, 1) by the map q 7 µq is compact, implying from (Christmann & Steinwart, 2010) that the kernel

k(f, g) := exp( 1

2σ2 f g 2 H)

is universal on that set. Thus, we have shown that k is universal on H under the strong topology (e.g. the norm topology in H). This is equivalent to the TV topology of P(S) since k has discrete masses by proposition 9 of (Amin et al., 2023a), and thus k P(Y) is universal on P(Y).

D.3. Proofs regarding the impact of approximate kernels

To prove the convergence of the ACMMD Rel estimator and the validity of its test, we rely on an augmented U-statistics formulation. Let: U := (Q|X, Y 1, . . . , Y R, Y , Y ) PQ|X Q r | Q| PQ | := U

U is the random variable which, for each model Q|X, concatenates the synthetic samples ( Y 1, . . . , Y R) used to perform the kernel approximation bk P(Y)(q, q ), the synthetic sample Y used to evaluate h, and Y , a sample from PQ | the conditional

Kernel-Based Evaluation of Conditional Biological Sequence Models

Algorithm 3 ACMMD Rel Conditional Goodness of fit Test

Input: {Xi, Yi, Yi}N i=1 i.i.d. PX P| Q| Parameters: Level α, kernel k X , kernel k Y // Estimate ACMMD Rel using Algorithm 2 and collect the bh(Zi, Zj) of Equation (11)

\ ACMMD Rel 2, {h(Zi, Zj)}1 i<j N estimate acmmd rel({Xi, Yi, Yi}N i=1, Q|Xi) [W b i Rademacher for i 1, . . . , N for b 1, . . . , B] [ ACMMD2 b 2 N(N 1) P

1 i<j N W b i W b j bh(Zi, Zj), for b in 1, . . . , B]

// See Appendix C.1 for how to compute bq1 α bq1 α approx. (1 α)-quantile of { ACMMD2 b}B b=1 if \ ACMMD2 bq1 α then

Fail to reject H0 else

Reject H0 end if

distribution of Y given Q|X. Then, given N realizations {Ui}N i=1, of U, the estimator \ ACMMD Rel 2 can be written as a U-statistics on the {Ui}N i=1 \ ACMMD Rel 2 = 2 N(N 1)

1 i<j N ha(Ui, Uj)

where ha(Ui, Uj) := bk({ Y r i }R r=1, { Y r j }R r=1) (k Y(Yi, Yj) + k Y( Yi, Yj) k Y(Yi, Yj) k Y( Yi, Yj))

We also will note ACMMD 2 a := EU1,U2 U U ha(U1, U2)

D.3.1. PROOF OF PROPOSITION 4.4

With this formalism, we now prove that the ACMMD Rel test has the specified type-I error rate of α (0, 1), e.g. rejects H0 when P|x = Q|x with probability α. Indeed, straightforward adaptations of the arguments in Appendix C.2 show that that doing a wild bootstrap using the bh(Zi, Zj) is equivalent to estimating

1 i<j N ha(U σ i , U σ j ) := \ ACMMD Rel 2 σb

where U σ i := (Q|Xi, Y 1 i , . . . , Y R i , Y σ i , Y σ i ), where (Y σ i , Y σ i ) = (Yi, Yi) if σ(i) = 1 and ( Yi, Yi) otherwise. The same argument to show that ACMMD test has the desired type-I error rate follows in this case too: Under P|x = Q|x, the sequence

{ \ ACMMD Rel 2 σb}B b=0 is exchangeable (noting \ ACMMD Rel 2 σ0 = \ ACMMD Rel 2, e.g. σ0(i) = 1 for all 1 i N), and we can repeat the derivations of the proof of Lemma C.2 to show that the ACMMD Rel test has the desired type-I error rate.

D.3.2. PROOF OF PROPOSITION 4.3

We prove a slightly more general version of the proposition, for kernels of the form ϕ(d(q, q )2), where ϕ is a Lipschitz function and d is a distance on P(Y). Setting ϕ = e σ2 , we recover the kernels of Proposition 4.3, which include the exponentiated MMD kernel.

Proposition D.1. Assume that k Y and k P(Y) is a kernel of the form k P(Y)(q, q ) = ϕ(d(q, q )), for a Lipschitz function ϕ and a function d(q, q ) admitting an unbiased estimator of the form bd({yr 1}R r=1, {yr 2}R r=1) where {yr 1}r r=1 and {yr 2}r r=1 are R i.i.d samples of q and q respectively, with variance O( 1

R) (the bound in uniform in q and q ). Then, assuming R R(N), with lim N R(N) = + , \ ACMMD Rel 2 converges in probability to ACMMD Rel 2 as N .

Kernel-Based Evaluation of Conditional Biological Sequence Models

Proof. As discussed above, the estimator \ ACMMD Rel 2 can be written as a U-statistics on the {Ui}N i=1, where Ui = (Q|Xi, Y 1 i , . . . , Y R i , Yi, Yi), and using the kernel ha defined as (accounting approximating k Y through d directly)

ha(Ui, Uj) := ϕ(bd({ Y r i }R r=1, { Y r j }R r=1)) (k Y(Yi, Yj) + k Y( Yi, Yj) k Y(Yi, Yj) k Y( Yi, Yj))

\ ACMMD Rel 2 = 2 N(N 1)

i<j ha(Ui, Uj)

To study the convergence in probability of \ ACMMD2 a to ACMMD2 a, we use finite-sample bounds on U-statistcs Hoeffding (1994):

| \ ACMMD2 a ACMMD 2 a| > ha

for all δ > 0, where, by assumption, k Y bounded, and bk P(Y) is of the form ϕ(bd({yr 1}R r=1, {yr 2}R r=1)) for some bounded function ϕ, implying that ha is bounded. To show the dependence in R, we bound the difference ACMMDa and ACMMD.

|ACMMD 2 a ACMMD 2| = EU,U h (bk P(Y)({Y r 1 }R r=1, {Y r 2 }R r=1) k P(Y)(Q|X1, QX2)) (k Y(Y1, Y2) + k Y( Y1, Y2)

k Y(Y1, Y2) k Y( Y1, Y2)) i

4 k Y |EPQ| Q r PQ| Q rbk P(Y)({Y r 1 }R r=1, {Y r 2 }R r=1) k P(Y)(Q|X1, QX2)|

4 k Y ϕ Lip EPQ| Q r PQ| Q r|bd({Y r 1 }R r=1, {Y r 2 }R r=1) d(Q|X1, QX2)|

4 k Y ϕ Lip

h EQ r Q r|bd({Y r 1 }R r=1, {Y r 2 }R r=1) d(Q|X1, QX2)| Q|X1, QX2 i

4 k Y ϕ Lip EPQ| PQ|

VQ r Q r bd({Y r 1 }R r=1, {Y r 2 }R r=1) Q|X1, QX2

Where the last inequality follows from Jensen s inequality and the unbiasedness of bd. The result follows by applying the assumption on the variance of bd (a bound which we assume is uniform in Q|X)1, Q|X2).

The term VQ r Q r h bd({Y r 1 }R r=1, {Y r 2 }R r=1)|Q|X1, Q|X2 i can be more precisely characterized depending on bd. For

instance, we have, when bd is a U-statistics (for instance, using the MMD estimator of Gretton et al. (2012, Lemma 6, Equation 4)) , that (Serfling, 2009, section 5.2.1) VQ r Q r bd({Y r 1 }R r=1, {Y r 2 }R r=1) < ζ(Q1, Q|X2)/R, where ζ(Q|X1, Q|X2) := V(Y1, Y1),(Y2, Y2) Q|X1 Q|X2( h((Y1, Y1), (Y2, Y2)) and h((Y1, Y1), (Y2, Y2)) = k Y(Y1, Y2) + k Y( Y1, Y2) k Y(Y1, Y2)

k Y( Y1, Y2), which is uniformly bounded by 4 k Y for bounded kernels. Putting the two parts together, we thus have that:

n \ ACMMD2 a ACMMD2o > 4 k Y

2 N/2 + 16 k Y 2 ϕ Lip

for all δ > 0, showing the convergence in probability of \ ACMMDa to ACMMD.

D.4. Additional Details for ACMMD and ACMMD Rel in the synthetic example

D.4.1. DERIVATIONS OF ACMMD IN THE SYNTHETIC EXAMPLE

We first prove that ACMMD is proportional to p.

Lemma D.2. In the setting described in Section 6.1, we have

ACMMD2(P|, Q|) = C p2

Kernel-Based Evaluation of Conditional Biological Sequence Models

C := ZZ k X (p, p )2(1 e λ) (1 2p)(1 2p )

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1 d PX(p)d PX(p )

Proof. Recall that we have

ACMMD2 = MMD2(PX P|, PX Q|)2 = Z k X (p, p ) (T11 + T22 2T12) d PX(p)d PX(p )

T12 = Z k Y(y, y )p(y|p)q(y |p )d(y)d(y )

and T22 and T12 are defined similarly. For a sequence y, we define the the function len given by len(y) := min {i N|yi = STOP}, which intuitively returns the length of the sequence.

Computing Tij As we will see, a lot of the computations are agnostic to whether we are computing T11, T22 or T12. Note that the exponentiated hamming distance kernel on Y writes as a product

k Y(y, y ) = e λd H(y,y ) = e λy P i=0 δ(yi =y i) =

i=0 e λδ(yi =y i) =

max(len(y),len(y )) Y

i=0 e λδ(yi =y i)

let us define the following events

F(m) := n min(len(y), len(y )) = m o

G(m, δm) := n max(len(y), len(y )) = m + δm o

which we further break down as

F1(m) = {len(y) = m} {len(y ) > m}

F2(m) = {len(y) > m} {len(y ) = m}

F3(m) = {len(y) = m} {len(y ) = m}

= F(m) = F1(m) F2(m) F3(m)

For which the following probabilities hold:

P(F1(m)) = P(len(y) = m) P(len(y ) > m) = ((2p)m (1 2p)) (2p )m+1

P(F2(m)) = P(len(y ) = m) P(len(y) > m) = ((2p )m (1 2p )) (2p)m+1

P(F3(m)) = P(len(y ) = m) P(len(y) = m) = ((2p )m (1 2p )) ((2p)m (1 2p))

P(G(m, δm)|F1(m)) = P(len(y ) = m + δm|len(y) = m, len(y ) > m) = (2p )δm 1 (1 2p )δ(δm 1) P(G(m, δm)|F2(m)) = P(len(y) = m + δm|len(y ) = m, len(y) > m) = (2p)δm 1 (1 2p)δ(δm 1) P(G(m, δm)|F3(m)) = δ(δm = 0)

Let us note E(m, δm, i) := Fi(m) G(m, δm)

We have that E(m, δm, i) E(m , δm , j) = if (m, δm, i) = (m , δm , j).

δm=0 E(m, δm, i)

Kernel-Based Evaluation of Conditional Biological Sequence Models

Using the law of total probability, we have that Thus, using the law of total probability:

Tij(p, p ) =

δm=0 P(E(m, δm, i))E(e λd H(y,y )|E(m, δm, i))

δm=0 P(Fi(m) G(m, δm))E(e λd H(y,y )|E(m, δm, i))

i=1 P(Fi(m))

δm=0 P(G(m, δm)|Fi(m))E(e λd H(y,y )|E(m, δm, i))

i=1 P(Fi(m))E(e λd H(y:m,y :m)|Fi(m))

δm=0 P(G(m, δm)|Fi(m))E(e λd H(ym+1:m+max(δm,1),y m+1:m+max(δm,1))|E(m, δm, i), p, p )

i=1 P(Fi(m))E(e λd H(y:m,y :m)|Fi(m))

δm=0 P(G(m, δm)|Fi(m))e λ(max(0,δm 1)+δ(m>0))

i=1 P(Fi(m))E(e λd H(y:m,y :m)|Fi(m))

max(m 1,1) Y

i=1 E(e λδ(yi =y i)|Fi(m)))

δm=0 P(G(m, δm)|Fi(m))e λ(max(0,δm 1)+δ(m>0))

where we break down the factorized hamming distance over the sequence into the sum of the hamming distances over each coordinate, and made use of the fact that

d H(ym:m+δm, y m:m+δm) = max(0, δm 1) + δ(m > 0)

conditioned on Fi(m) and G(m, δm). The disjunction of cases is necessary in order to not count the term 0th term twice in the event when m = 0. This representation is convenient since whenever m 2, for any 1 i m 1,

P(δ(yi, y i) = 1|Fi(m)) = (pp ) + (pp ) (p + p) (p + p ) = 1

2 = P(δ(yi, y i) = 0|Fi(m))

meaning we have

Tij(p, p ) =

i=1 E(e λδ(y0 =y 0)|Fi(m))P(Fi(m)) 1 + e λ

δm=0 P(G(m, δm)|Fi(m))e λ(max(0,δm 1)+δ(m>0))

Kernel-Based Evaluation of Conditional Biological Sequence Models

Inserting the relevant event probabilities into the expression for Tij, we have

Tij(p, p ) =

E(e λδ(y0 =y 0)|F1(m))(2p)m (1 2p)(2p )m+1 (1 2p )e λδ(m>0) + X

δm=1 e λ(δm 1)(2p )δm 1

+ E(e λδ(y0 =y 0)|F2(m))(2p )m (1 2p )(2p)m+1 (1 2p)e λδ(m>0) + X

δm=1 e λ(δm 1)(2p)δm 1

+ E(e λδ(y0 =y 0)|F3(m))(2p )m (1 2p )(2p)m(1 2p)

E(e λδ(y0 =y 0)|F1(m))(2p)m (1 2p)(2p )m+1 (1 2p )e λδ(m>0) + X

δm=0 e λδm(2p )δm

+ E(e λδ(y0 =y 0)|F2(m))(2p )m (1 2p )(2p)m+1 (1 2p)e λδ(m>0) + X

δm=0 e λδm(2p)δm

+ E(e λδ(y0 =y 0)|F3(m))(2p )m (1 2p )(2p)m(1 2p)

E(e λδ(y0 =y 0)|F1(m))(2p)m (1 2p)(2p )m+1 (1 2p ) e λδ(m>0)

+ E(e λδ(y0 =y 0)|F2(m))(2p )m (1 2p )(2p)m+1 (1 2p) e λδ(m>0)

+ E(e λδ(y0 =y 0)|F3(m))(2p )m (1 2p )(2p)m(1 2p)

Now, some simplification arise when m 1. Indeed, in that case, E(e λδ(y0,y 0)|Fi(m)) is independent of i. Noting T 1 ij(p, p ) the sum of the terms for m 1, we thus have

T 1 ij(p, p ) =

m=1 E(e λδ(y0 =y 0)|F(m)) 1 + e λ

(2p)m (1 2p)(2p )m+1 (1 2p ) e λ

+ (2p )m (1 2p )(2p)m+1 (1 2p) e λ

+ (2p )m (1 2p )(2p)m(1 2p)

Kernel-Based Evaluation of Conditional Biological Sequence Models

Noting Aij the term E(e λδ(y0 =y 0)|F(m)), which is constant for all m 1

T 1 ij(p, p ) = Aij

(2p)m (1 2p)(2p )m+1 (1 2p ) e λ

+ (2p )m (1 2p )(2p)m+1 (1 2p) e λ

+ (2p )m (1 2p )(2p)m(1 2p)

= Aij(1 2p)(1 2p )4pp ( 2p e λ

1 2p e λ + 2pe λ

1 2pe λ + 1)

m=0 (4pp (1 + e λ)/2)m

= Aij (1 2p)(1 2p )4pp

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1

C(p, p ) = (1 2p)(1 2p )4pp

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1

is a constant that does not depend on i, j. We compute the m = 0 sum, noted T 0 ij(p, p ). We have

T 0 ij(p, p ) = E(e λδ(y0 =y 0)|F1(0)) (1 2p)(2p ) (1 2p ) 1 1 2p e λ

+ E(e λδ(y0 =y 0)|F2(0)) (1 2p )(2p) (1 2p) 1 1 2pe λ

+ E(e λδ(y0 =y 0)|F3(0)) (1 2p )(1 2p)

And we need to compute the terms E(e λδ(y0 =y 0)|Fi(0)) indivdually.

i=1, i=2 For i = 1, we must have y0 = y 0, since y0 = STOP, and len(y ) > 0. Thus, E(e λδ(y0 =y 0)|F1(0)) = e λ. Similarly, E(e λδ(y0 =y 0)|F2(0)) = e λ.

i=3 In that case, we must have y0 = y 0 = STOP, since len(y) = len(y ) = 0. Thus, E(e λδ(y0 =y 0)|F3(0)) = 1.

Putting this together, we have

T 0 ij(p, p ) = (1 2p)(1 2p ) 2p e λ

1 2p e λ + 2pe λ

1 2pe λ + 1

With that notation, we have:

ACMMD2(P|, Q|) = Z k X (p, p )C(p, p )(A11 + A22 2A12)d PX(p)d PX(p )

+ Z k X (p, p )(T 0 11(p, p ) + T 0 22(p, p ) 2T 0 12(p, p ))d PX(p)d PX(p )

= Z k X (p, p )C(p, p )(A11 + A22 2A12)d PX(p)d PX(p )

since T 0 ij does not depend on i, j. We can narrow the variation down even further: by noting p A ij = P(δ(yi = y i) = 0|F(m)) (resp p B ij = P(δ(yi = y i) = 0|F(0))), since E(e λϵ) = p(ϵ = 0)(1 e λ) + e λ if ϵ is a Bernoulli random variable,

ACMMD2(P|, Q|) = Z k X (p, p )C(p, p )(1 e λ)(p A 11 + p A 22 2p A 12)d PX(p)d PX(p )

Kernel-Based Evaluation of Conditional Biological Sequence Models

We now compute the probabilities p A ij for i, j {1, 2}. In every case, such p A ij can be written as:

p A ij = P(y0 = y 0 = A) + P(y0 = y 0 = B) P({y0 {A, B}} {y 0 {A, B}}) P(y0 = y 0 = A) + P(y0 = y 0 = B) 4pp

and we have

p A 11 = pp + pp

p A 22 = (p + p)(p + p) + (p p)(p p)

4pp = 2pp + 2 p2

p A 12 = (p)(p + p) + (p)(p p)

= p A 11 + p A 22 2p A 12 = 2pp + 2 p2

Putting it together We thus have

ACMMD(P|, Q|) = Z C(p, p )k(p, p )(1 e λ)2 p2

4pp d PX(p)d PX(p )

Recalling that

C(p, p ) = (1 2p)(1 2p )4pp

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1

yields the desired result.

D.4.2. CLOSED-FORM ACMMD Rel EVALUATION

Assuming the same model, it is also possible to evalute ACMMD Rel(P|, Q|) in closed form. Indeed, ACMMD Rel becomes a special case of the ACMMD formula given above, with the conditionned variable X set to be the models Q|X. It is thus possible to show:

Lemma D.3. We have ACMMD Rel2(P|, Q|) = C p2

C = ZZ k P(Y)(q|p, q|p )2(1 e λ) (1 2p)(1 2p )

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1 d PX(p)d PX(p )

The above lemma leaves the choice of the kernel k P(Y) open: the tractability of this expression will follow only if such kernel can be tractably computed. In the next lemma, we derive a closed form solution for k P(Y)(q, q ) when

k P(Y)(q, q ) = e MMD2(q,q )

2σ2 , where the MMD is computed with an Exponentiated Hamming kernel on Y.

Lemma D.4. We have MMD2(q|p, q|p ) = T(p, p) + T(p , p ) 2T(p, p )

Where T(p, p ) = C(p, p )A(p, p ) + T 0(p, p )

C(p, p ) = (1 2p)(1 2p )4pp

1 4pp (1 + e λ)/2

1 2p e λ + 2pe λ

1 2pe λ + 1

A(p, p ) = 2pp + 2 p2

4pp (1 e λ) + e λ

T 0(p, p ) = (1 2p)(1 2p ) 2p e λ

1 2p e λ + 2pe λ

1 2pe λ + 1

Combining the two lemmas allows us to obtain a computable expression for ACMMD Rel(P|, Q|).

Kernel-Based Evaluation of Conditional Biological Sequence Models

0 500 1000 Num. Samples

1.0 ACMMD Rel Test: Rejection Rate

α : 0.05 p : 0.0

p : 0.05 p : 0.07

p : 0.1 p : 0.2

0 500 1000 Num. Samples

ACMMD Test: Rejection Rate

α : 0.05 p : 0.0

p : 0.05 p : 0.07

p : 0.1 p : 0.2

0 500 1000 Num. Samples

ACMMD Estimates

ACMMD2 \ ACMMD2

0 500 1000 Num. Samples

ACMMD Rel Estimates

ACMMD2 Rel \ ACMMD2 Rel

Figure 6. Top left: Rejection Rate of the ACMMD test as a function of the dataset size, for different values of p. Top right: Rejection Rate of the ACMMD test as a function of p, for different dataset sizes. Bottom left: Estimated ACMMD as a function of the dataset size. Bottom right: Estimated ACMMD Rel as a function of p. To compute these estimates, we use dataset sizes of {10, 100, 200, 500, 1000}, used m = 5 atoms for the prior on p between p1 = 0.3, p2 = 0.45, used λ = 1, p = 0.25, and average over 300 runs. In addition, we plot the true value ACMMD(P|, Q|) using the closed-form expression derived above.

Kernel-Based Evaluation of Conditional Biological Sequence Models

E. Additional Experiments

E.1. Additional Experiments for the semi-synthetic Protein MPNN data

In addition to the figures of Section 6.2.1, which use T = 0.1 to plot the estimates and rejection rates of ACMMD and ACMMD Rel on the Protein MPNN synthetic data, we provide here the same plots for T = 1.0 the value used to train Protein MPNN. We notice that detecting a given change in temperature is sligtly simpler for T = 1.0 than for T = 0.1.

0 1000 2000 3000 Num. Samples

0 1000 2000 3000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

250 500 750 1000 Num. Samples

\ ACMMD Rel2

250 500 750 1000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

250 500 750 1000 Num. Samples

250 500 750 1000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

250 500 750 1000 Num. Samples

\ ACMMD Rel2

250 500 750 1000 Num. Samples

Rejection rate

0.0 0.01 0.05 0.1

Figure 7. ACMMD and ACMMD Rel estimates and rejection rate in the semi-synthetic setting of Section 6.2.1. The different lines indicate a different temperature shift between the two MPNN models. Top panel shows uses a base temperature of T = 1.0, while the bottom panel uses T = 0.1.

E.2. Additional Experiments for the structural superfamily evalution

We include in Figure 8 the values of \ ACMMD Rel 2 for different superfamilies (which was not included in Section 6.2.2), and compare it with the values of \ ACMMD2. In line with the hyperparameter tuning results of Section 6.2.2, we notice that high temperature are highly detrimental from a reliability perspective. Intuitively, increasing the temperature of Protein MPNN makes the model underconfident . Since a reliable model is neither over nor underconfident, this decrease of confidence is penalized by ACMMD Rel. This also shows that increasing the temperature of a model does not make the model fallbak to its prior distribution (otherwise the model would be more reliable). Instead, it just increases the uncertainty of the model in a detrimental fashion.

F. Known Kernels for protein sequences and structures

In the context of inverse folding, computing the ACMMD requires a kernel on sequences k Y and a kernel on protein structures. This section contains a brief overview of non neural-network based, known kernels for protein sequences and structures. The main desiderata to achieve when choosing kernels for computing goodness-of-fit criterion is to find kernels that are able to detect (up to statistical noise) any deviation from a perfect fit between the model and the data. Such kernels are referred to as universal.

Kernel-Based Evaluation of Conditional Biological Sequence Models

Homeodomain-like

PI 3-kinase

Acid Proteases

SH3 domains

WHL DNA binding

Immunoglobulins

Superfamily

Homeodomain-like

PI 3-kinase

SH3 domains

WHL DNA binding

Acid Proteases

Immunoglobulins

Superfamily

\ ACMMD Rel2

Figure 8. Value of \ ACMMD2 (left) and \ ACMMD Rel 2 (right) between Protein MPNN and the CATH S60 reference dataset on a subset of 10 superfamilies for two different temperatures T = 1.0 and T = 0.1.

The protein formalism The most general formalism for the space protein structures is the set of equivalence classes of graphs where the equivalence relationship is defined to the existence of a graph isomorphism. The need for equivalence classes stems from the fact that different labelling policies exist for a given protein, meaning that a single protein can be associated to multiple graphs. However, this labelling will in practice not be completely arbitrary: first, the set of candidate labelling can be restricted to the ones consistent with covalent bounds. But in the inverse folding problem, the setting is even simpler: the protein structure is restricted to its backbone, which is sequential by nature. This limits the set of covalent-bound consistent labelling policies to two (the forward and the backward one), and my vague understanding is that there is a terminal atom in protein, which suggests the existence of a canonical direction: thus, only one labelling policy remain, and protein structures can thus be associated to the set of atom locations S+ i=1 Ri. This set differs from the set of protein sequences S+ i=1 A in that the alphabet is the real line instead of a finite set of symbols. The restriction from the space of graphs to the space of variable-length sequences since there it is known that no graph kernels commonly in use are even characteristic (Kriege et al., 2020). The space S+ i=1 X (for arbitrary X have been investigated by the time series community), which have developed a set of kernels to carry out data analysis on it. We provide some background on such kernels below.

Background: alignment kernels for real-valued sequences of arbitrary length Alignment kernels (Cuturi et al., 2007; Cuturi & Blondel, 2017; Saigo et al., 2004; Vert et al., 2004) refer to a diverse set of variety of kernels constitute a family of kernels on S+ i=1 X i that are computed based on aggregating the similarities between all possible alignment candidates between two inputs x1 and x2. There are two main subfamilies of alignment kernels, which both use slightly different alignment definitions: local alignment kernels, and global alignment kernels.

Local alignment kernels Local alignment kernels (Saigo et al., 2004; Vert et al., 2004) are kernels of the form

k LA(x, y) = X

π Π(x,y) exp(βs(x, y, π)) (15)

s(x, y, π) =

i=1 s(x(π1(i)) 1 , x(π2(i)) 2 ) +

i=1 g(π1(i + 1) π1(i)) + g(π2(i + 1) π2(i))

and Π(x, y) is the set of all possible alignments of x and y, e.g. the set of all 2-tuple of p-long sequences

π := ((π1(1), . . . , π1(p)), (π2(1), . . . , π2(p))

where 1 π1(1) < π1(2) < π2(p) n

1 π2(1) < π1(2) < π2(p) m

Kernel-Based Evaluation of Conditional Biological Sequence Models

Importantly, local alignment kernels involve a gap function, and thus give a specific status to insertions and deletions, unlike global alignment kernels, as we will see below. The local alignment kernel can be seen as computing the (soft) minimum of a discrepancy within the set of all possible alignments. The use of a soft minimum (and not a hard one) is crucial to ensure positive definiteness. Local alignment kernels seem to have been designed initially for finite alphabets target. When g = 0, the necessary and sufficient condition on s to ensure that k LA is a positive definite is for s to be a conditionally positive definite kernel 4. This is in particular verified if (s(xi, yi))1 i,j |A| is positive definite. I need further reading to investigate whether the case of infinite X was studied.

Global alignment kernels Global alignment kernels (Cuturi et al., 2007; Cuturi & Blondel, 2017) also perform a softmin over alignment, but do not incorporate gaps in their score, and use a slightly different notion of alignment, namely:

π := ((π1(1), π2(1), . . . , (π1(p), π2(p))

where now, the constraints on π1 and π2 are

1 = π1(1) < π1(2) < π2(p) = n

1 = π2(1) < π1(2) < π2(p) = m

π1(i + 1) π1(i) + 1 unitary increments

(π1(i + 1) π1(i)) + (π2(i + 1) π2(i)) 1 no repetitions

Unlike the previous alignment definition, this one explicitly maps each item in each sequence with another item in the other sequence, and does not try to account for potential gaps. Let us call A the set of all alignment. The final definition for a global alignment kernel is then:

k GA(x, y) = X

π A(x,y) exp(

i=1 s(x(π1(i)) 1 , x(π2(i)) 2 ) (16)

As stated in Cuturi et al. (2007, Theorem 1), k GA will be positive definite if k(x, y) := exp(s(x, y)) is a positive definite kernel such that k (1 k) is positive definite.

4A kernel is c.p.d if Pn i,j=1 cicjs(x(i), x(j)) 0 c1, . . . , cn, c1 + + cn = 0.