# biological_sequence_kernels_with_guaranteed_flexibility__0f286c94.pdf

Journal of Machine Learning Research 26 (2025) 1-63 Submitted 4/23; Revised 7/25; Published 10/25

Biological Sequence Kernels with Guaranteed Flexibility

Alan N. Amin alanamin@nyu.edu New York University

Debora S. Marks debbie@hms.harvard.edu Harvard Medical School Broad Institute of Harvard and MIT

Eli N. Weinstein enawe@dtu.dk Technical University of Denmark

Editor: Jean-Philippe Vert

Applying machine learning to biological sequences DNA, RNA and protein has enormous potential to advance human health and environmental sustainability. To support such high-stakes applications, it is important to develop models and evaluations that not only capture underlying biology, but also have theoretical guarantees of reliability and performance. In this article, we analyze kernel methods for biological sequences, including both hand-crafted kernels and deep neural network-based kernels. We show that popular biological kernels can severely fail at learning functions or distinguishing distributions. We then develop modiﬁed kernels that (1) are universal, characteristic, and metrize the space of distributions, and (2) preserve the underlying biological inductive biases and domain knowledge embedded in the original kernel. Our results rest on novel proof techniques for kernels that handle the structure of biological sequence space discrete, variable length sequences and biological notions of sequence similarity. We illustrate our theoretical results in simulation and on real biological data sets. Keywords: kernel methods, sequences, biology, nonparametric statistics, representations

1. Introduction

DNA, RNA and protein sequence data have become a rich area of scientiﬁc application for modern machine learning, with tools emerging to diagnose genetic diseases, engineer proteins, and design therapeutics (Frazer et al., 2021; Yang et al., 2019; Notin et al., 2024). As high-stakes real-world applications proliferate, it becomes increasingly important to develop robust evaluations, reliable procedures, and accurate theoretical understanding. One promising route to advancing these goals is through kernel methods. In principle, kernel methods can simultaneously oﬀer: (1) rich inductive biases, allowing scientists to build in prior domain knowledge, (2) ﬂexibility, ensuring models can capture arbitrary complexity in the data, and (3) theoretical tractability, enabling rigorous guarantees and understanding of learning procedures. Computational biologists have developed kernels with biological inductive biases, that capture our basic understanding of sequence evolution and molecular function (Schweikert

. Equal contribution.

c 2025 Alan N. Amin and Debora S. Marks and Eli N. Weinstein.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0455.html.

Amin, Marks, and Weinstein

et al., 2008; Toussaint et al., 2010; Leslie et al., 2002; Yang et al., 2018; Alley et al., 2019). The problem, as we will show, is that these kernels are inﬂexible. For example, they are not universal, so they cannot describe arbitrary functions. As a consequence, methods that use these kernels do not have strong theoretical guarantees on their success. This article proposes modiﬁcations to popular biological sequence kernels that provide theoretical guarantees on ﬂexibility, while preserving their biological inductive biases. Speciﬁcally, we construct kernels that are:

1. Universal. This ensures that kernel regression methods can capture any function on sequences. So, for instance, biologists can learn any sequence-to-phenotype relationship.

2. Characteristic. This ensures that the maximum mean discrepancy (MMD) between two distributions is zero if and only if the two distributions match exactly (Gretton et al., 2012). So, for instance, biologists can detect any mismatch between a deep generative model and real biological sequences.

3. Metrize the space of distributions. This ensures that minimizing the MMD with respect to the ﬁrst distribution yields the second. So, for instance, biologists can train simulators to match real biological sequence data by minimizing the MMD (Li et al., 2017; Park et al., 2016; Dellaporta et al., 2022)

These properties appear widely as necessary conditions for kernel-based machine learning methods, across tasks from representation learning to causal inference (Pogodin et al., 2022; Singh et al., 2023). Figure 1 provides a toy illustration of the impact of our theory on a regression problem (details in Appendix A). We compare a standard kernel (black) to one of our modiﬁed kernels (blue). Both kernels judge sequence dissimilarity based on the Hamming distance, a common measure of mutational distance in biology. But the kernels diﬀer in their functional form. As a result, the standard kernel fails to learn the true function, while our new universal kernel succeeds. Studying kernels on biological sequence space presents unique theoretical challenges. Biological sequences are strings of characters (nucleotides or amino acids), and typically vary in length across a data set. Thus, following previous theoretical work on biological sequences, we take sequence space to be the set of all ﬁnite-length strings L=0BL of an alphabet B, where e.g. B = {A, T, G, C} for DNA (Amin et al., 2021; Weinstein, 2022). Biological sequence space is therefore discrete and inﬁnite, while the vast majority of previous theoretical studies have been done on continuous or ﬁnite spaces. We present the ﬁrst theoretical results on the ﬂexibility of kernels for biological sequences. Our ﬁrst step is to show that kernels on biological sequences have guaranteed ﬂexibility if they have discrete masses, i.e. if the associated Hilbert space contains delta functions (Section 4) (Jorgensen and Tian, 2015). Then we develop a suite of tools to prove that kernels have discrete masses (Section 5). We apply these methods to study four key classes of biological sequence kernel that are in practical use by computational biologists. For each, we diagnose problems with current kernels, and develop modiﬁcations that grant ﬂexibility while preserving biological inductive biases.

Kernels for Biological Sequences

Figure 1: Least squares regression with Hamming and IMQ-H kernels. N is the size of the data set. Performance is measured by root mean squared error, normalized by the standard deviation of the outcome variable. Details in Appendix A

Position-wise comparison kernels (Section 6). These kernels compare sequences positionby-position, e.g. kernels based on the Hamming distance.

Alignment kernels (Section 7). These kernels compare sequences with a pairwise alignment, accounting for insertion and deletion mutations.

Kmer spectrum kernels (Section 8). These kernels compare sequences based on their sub-strings (kmers), accounting for similarities in the set of motifs in each sequence.

Embedding (Section 9). These kernels compare sequences based on an embedding into Euclidean space. The embedding can come from neural network, e.g. a representation from a foundation model.

Table 1 gives a general overview. We illustrate our results on both synthetic and real data in Section 10, showing they result in improved empirical performance.

2. Notation

In this section we establish notation for reference in the rest of the text. We let S be a countably inﬁnite set with the discrete topology. In sections 4 and 5 where we study guarantees for kernels on arbitrary inﬁnite discrete spaces, S can be any countable inﬁnite set; in the rest of the paper we will speciﬁcally be interested in the case when S is the space of sequences (deﬁned below). For any ﬁnite set B we will deﬁne |B| as its cardinality. We let N be the set of natural numbers {0, 1, 2, . . . }. We also deﬁne 1(P) to be the indicator function, which is 1 if P is true and 0 if it is false. If y, z R, we deﬁne y z as their maximum and y z as their minimum. For X S, by δX we either mean the function or measure that is 1 on X and 0 everywhere else. We call C0(S) the space of functions on S that vanish at inﬁnity and

Amin, Marks, and Weinstein

Table 1: Overview of standard kernels and proposed modiﬁcations. Each section of the paper studies a diﬀerent family of biological sequence kernels. In this table, we list an especially well-known instance of each family, and an example of an alternative we propose, which has guaranteed ﬂexibility and similar biological inductive biases. We report the computational complexity of standard algorithms for evaluating each example on two sequences of length L. Here, k is k-mer length and d is the embedding dimension (we ignore the ﬁxed cost of embedding each sequence and assume an RBF kernel in the embedding space).

Kernel Has discrete masses Compute Section

Hamming No O(L) Sec. 6

IMQ-H Yes O(L)

Alignment Sometimes O(L2) Sec. 7

Heavy-tailed alignment Yes O(L3)

Kmer spectrum No O(k L2) Sec. 8

Inﬁnite spectrum Yes O(L2)

Embedding Usually no O(d) Sec. 9

Scaled embedding Usually yes O(d)

the inﬁnity norm on C0(S). We deﬁne CC(S) to be the set of functions on S that are non-zero at only ﬁnitely many points. We deﬁne P(S) to be the space of probability distributions on S. A kernel is a function k : S S R such that for any ﬁnite set of Y1, . . . , YN S and α1, . . . , αN R, P n,m αnαmk(Yn, Ym) 0. We deﬁne the inner product ( | )k on linear combinations of functions k Y = k(Y, ) so that (PN n=1 αnk Xn| PM m=1 βmk Ym)k = P n,m αnβmk(Xn, Ym) for Xn, Ym S and αn, βm R. We denote the associated norm as k. We say that a kernel is strictly positive deﬁnite if PN n=1 αnk Xn = 0 when the Xn are distinct and αn are non-zero. We deﬁne the reproducing kernel Hilbert space (RKHS) of the kernel Hk as the Hilbert space completion using the inner product ( | )k. We can write every f Hk as a function on S given by f(X) = (f|k X)k. For a set of vectors {vλ}λ Hk, we deﬁne span{vλ}λ as the set of ﬁnite linear combinations of elements of {vλ}λ. Say µ is a signed measure on S such that R d|µ|(X) p

k(X, X) < . Then there is a f Hk such that for all g Hk we have (f|g)k = R dµ(X)g(X) and f 2 k = R dµ(X) R dµ(Y )k(X, Y ). This element of Hk is called the kernel embedding of the measure µ, denoted R dµ(X)k X. Let B be a ﬁnite set, the alphabet (for example, this would be the four nucleotides for DNA, or the twenty amino acids for proteins). We deﬁne the set of sequences as L=0BL

where B0 is deﬁned to be the set containing just the sequence of length zero, . For a sequence X, and numbers l, l , we deﬁne X(:l) as the ﬁrst l letters in X, X(l:l ) as the

Kernels for Biological Sequences

l l letters after the ﬁrst l letters and X(l:) as all letters after the ﬁrst l letters, which is potentially the empty sequence . We call X(l) the l-th letter of X, starting counting at 0. The concatenation of sequences X, Y is denoted X + Y . For any number l, the sequence X concatenated to itself l times is denoted l X. We call d H(X, Y ) the Hamming distance between the sequences X and Y , that is the total number of mismatches between the two sequences after they have been padded with an inﬁnite tail of stop symbols $.

3. Related work

There is a large body of theoretical work studying kernel ﬂexibility, much of it focused on the properties of universality, characteristicness, and metrizing probability distributions (Sriperumbudur et al., 2009, 2010, 2011; Fukumizu et al., 2008; Christmann and Steinwart, 2010). These results have been foundational in providing strong guarantees on a wide variety of kernel methods, including not only regression methods but also two-sample tests (Gretton et al., 2012), independence tests (Gretton et al., 2007), sample quality evaluation (Gorham and Mackey, 2017), learning implicit models (Li et al., 2017), learning models with intractable normalizing constants (Dai et al., 2019; Matsubara et al., 2022), causal inference (Singh et al., 2021), Bayesian model selection and data selection (Weinstein and Miller, 2023) and much more. However, virtually all existing ﬂexibility guarantees are for kernels on particular topological spaces. Proof methods apply Fourier transforms to their domains (Sriperumbudur et al., 2011) or require compact spaces (Christmann and Steinwart, 2010). Thus, when it comes to applying these powerful kernel methods to inﬁnite discrete spaces in general, and biological sequences in particular, theoretical justiﬁcation is lacking.

The situation is particularly challenging for kernels that are used in practice, which must capture real-world structure (the underlying biology) and be tractable computationally. Indeed, it is not entirely obvious a priori that strong ﬂexibility guarantees can even be established for practical biological sequence kernels: in the case of graphs, for instance, kernels that are characteristic are at least as hard to compute as the graph isomorphism problem, so no polynomial time algorithm exists (Kriege et al., 2020; G artner et al., 2003). In other words, tractability and ﬂexibility are not necessarily compatible.

Jorgensen and Tian (2015) established pioneering results on kernel ﬂexibility for inﬁnite discrete spaces, exploring the idea of kernels with discrete masses. We extend the scope and implication of their results dramatically, by (a) connecting the discrete mass property to the notions of ﬂexibility important for machine learning applications, and (b) studying kernels relevant to biological sequences. More broadly, we make the theory of Jorgensen and Tian (2015) more user-friendly , by establishing easier routes to proving the discrete mass property, and elaborating its practical implications.

Kernels for biological sequences are mathematically related to kernels for time series data. Many of our constructions below start with kernels ks that compare individual letters of the alphabet; these can be swapped with arbitrary kernels to compare sequences of any type of object. As well, many popular kernels for biological sequences can be seen as discretizations of kernels for time series data (Kiraly and Oberhauser, 2019). While the ﬂexibility properties of continuous time series kernels are well established (Hambly and

Amin, Marks, and Weinstein

Lyons, 2010), those relations unfortunately do not transfer to biological sequences kernels in a straightforward way.

4. The discrete mass property

In this section we introduce the discrete mass property and describe its implications for kernel ﬂexibility. We focus on properties necessary for procedures useful in biological sequence analysis. In particular, we show that if a kernel has discrete masses, it is (1) universal (an important property for consistent regression), (2) characteristic (important for consistent two-sample testing (Gretton et al., 2012; Fukumizu et al., 2007)) and (3) metrizes the space of distributions (important for optimizing sequence distributions (Li et al., 2017; Pronzato and Zhigljavsky, 2020; Husz ar and Duvenaud, 2012; Futami et al., 2019)). Later, we will design biological sequence kernels that have discrete masses to satisfy these three properties. We say a kernel has discrete masses if its Hilbert space includes delta functions at all points in the space S. Note that in this section we take S to be an arbitrary discrete space; later (starting in Section 6) we will specialize to the setting where S is sequence space.

Deﬁnition 1 (The discrete mass property) We say k has discrete masses if CC(S) Hk, where CC(S) is the set of all functions on S that are non-zero at only ﬁnitely many points. Since Hk is a linear space, this is equivalent to δX Hk for all X S.

For ease of exposition, in this section we will assume that k(X, X) = 1 for all X S.1

4.1 Universal kernels and function approximation

If the RKHS is large enough to describe any function in some very general set, we say it is universal. In particular, we focus on C0-universality, which says that the RKHS can approximate any function in C0, the set of all functions on S that vanish at inﬁnity (Sriperumbudur et al., 2011).

Deﬁnition 2 (C0-universality) Say k X C0(S) for all X S. We say k is C0-universal if for every f C0(S) and ϵ > 0 there is a g Hk such that f g < ϵ.

Are common biological sequence kernels universal? Many are not, because they fail a simple criteria: they do not have inﬁnite features (Leslie et al., 2002; Tsuda et al., 2002; Jaakkola et al., 2000).

Example 1 (The kmer spectrum kernel is not universal) The L-mer spectrum kernel uses as features the number of times each subsequence (kmer) up to length L occurs in a sequence (Leslie et al., 2002). It is deﬁned as k(X, Y ) = (Φ(X)|Φ(Y )) where Φ(X) is a vector with entries Φ(X)V specifying the number of times the kmer V appears in X, for all V S such that |V | L. Since the total number of features is ﬁnite, it is not universal.

Although inﬁnite features are necessary for a kernel to be universal, they are not suﬃcient: in Fig 1 (details in App. A) we gave an example of a kernel which has inﬁnite features but is

1. This does not result in a loss of generality as we can replace, in the arguments below, the space of distributions P(S) with {µ P(S) | R dµ(X) p

k(X, X) < } and replacing the space of functions that vanish at inﬁnity C0(S) with {X 7 f(X) p

k(X, X) | f C0(S)}.

Kernels for Biological Sequences

not universal. Instead, to prove our proposed new biological sequence kernels are universal, we will prove that they have discrete masses: since CC(S) is dense in C0(S) we get the following proposition.

Proposition 3 Kernels with discrete masses are C0-universal.

4.2 Characteristic kernels for distribution comparison

Besides regression, kernels can also be used to compare distributions. We can measure the diﬀerence between two distributions µ and ν using the maximum mean discrepancy.

Deﬁnition 4 (Maximum mean discrepancy) Recall the embedding of a measure µ, denoted R dµ(X)k X, is the element of Hk that satisﬁes ( R dµ(X)k X|g)k = R dµ(X)g(X) for all g Hk. The MMD is the norm of the diﬀerence in embeddings of two distributions µ, ν,

MMDk(µ, ν) = max {f Hk: f k 1} Eµf Eνf =

Z dµ(X)k X Z dν(Y )k Y

where f 2 k = (f|f)k.

In order for the MMD to be able to tell the diﬀerence between any two distributions, it must be characteristic.

Deﬁnition 5 (Characteristic kernel) We say a kernel k is characteristic if µ 7 R dµ(X)k X is injective on P(S), the space of probability distributions on S.

Note characteristicness is a strictly weaker condition that universality in general (Sriperumbudur et al., 2011). MMD is useful not only for comparing two given distributions µ and ν, but also as an optimization objective that we can minimize to ﬁnd a distribution ν that matches µ. Ideally, a good optimization objective should not only tell us if we have reached the right answer, but also if we are headed in the right direction. For this to hold, the kernel must metrize the space of distributions P(S).

Deﬁnition 6 (Metrizing P(S)) We say a kernel k metrizes P(S) if for every µ P(S) and sequence ν1, ν2, P(S), it holds that MMDk(µ, νn) 0 implies νn µ.

Note, for kernels on coninuous space, the analogous property is sometimes called metrizing weak convergence (Sriperumbudur et al., 2010). On P(S), however, the weak topology is equivalent to the total variation topology. Many popular biological sequence kernels are not characteristic and do not metrize P(S). For example, kernels with ﬁnitely many features, in addition to not being universal, are also not characteristic. As with universality, having an inﬁnite number of features is also not enough to make a kernel characteristic; one can easily show this for the Hamming kernel in Fig. 1 (details in App. A). Another class of kernels built oﬀrepresentation learning methods are are a popular choice for optimizing biological sequences, since low dimensional continuous optimization can be more tractable than high dimensional discrete optimization (Yang et al., 2018; Stanton et al., 2022; Notin et al., 2021). However, even very faithful representations, that preserve large amounts of information about sequence space, may not be able to metrize P(S).

Amin, Marks, and Weinstein

Example 2 (Embedding kernels are not guaranteed to metrize P(S)) Consider S = { , A, AA, . . .}. We deﬁne a target distribution µ = δA. To deﬁne a kernel, we consider the injective embedding F(A) = 0 and F(n A) = 1/n for n > 1, and use a radial basis function kernel in the embedding space, such that the complete embedding kernel is k(X, Y ) = exp( (F(X) F(Y ))2). Intuitively, since the radial basis function kernel is characteristic over R, and since the embedding F is one-to-one, the embedding kernel is characteristic (we prove this rigorously in Section 9). However, this embedding kernel does not metrize P(S). If we try to minimize MMDk(µ, δX) with respect to X, we ﬁnd that choosing longer and longer sequences brings the objective closer and closer to zero, MMD(µ, δn A) 0 as n .

To design new biological sequence kernels that are guaranteed to be characteristic and to metrize P(S), we again turn to discrete masses.

Proposition 7 (Kernels with discrete masses are characteristic and metrize P(S)) Say k is a kernel such that k(X, X) = 1 and k X C0(S) for all X S.

k has discrete masses = MMDk metrizes P(S) = k is characteristic.

Proof The second implication is clear. Now say k has discrete masses and µ, ν1, ν2, P(S) such that MMDk(µ, νn) 0. For each X S, δX, δX Hk, so by Equation 1 we have δX 1 k |µ(X) νn(X)| MMDk(µ, νn) 0. Thus, νn µ.

The discrete mass property thus guarantees that a kernel is universal, characteristic and metrizes P(S), making the kernel a good choice for a wide range of applications.2

4.3 Degenerate examples

The remainder of the paper will be concerned with designing biological sequence kernels with discrete masses. Before studying complicated kernels, however, we ﬁrst consider two simple but degenerate kernels with discrete masses, and explain why they are unsatisfactory. One kernel with discrete masses is the identity kernel.

Example 3 (The identity kernel has discrete masses) Consider a kernel k(X, Y ) = δX(Y ) for all X, Y S. This kernel has discrete masses, since k X = δX for all X. The problem is that it leads to poor generalization.

As this example demonstrates, the mere fact that a kernel has discrete masses does not mean it is a good modeling choice. Instead, we want kernels that not only have discrete masses but also capture biological notions of sequence similarity. Another way to construct ﬂexible kernels is to assume that S is ﬁnite.

Proposition 8 Say |S| < . If k is strictly positive deﬁnite then it has discrete masses.

2. Note that the discrete mass property is stronger than all three of these properties, as there exist kernels that are universal and metrize P(S) but do not have discrete masses (Appendix B).

Kernels for Biological Sequences

Proof For any X, Y S, δX(Y ) = P X S αX k(X , Y ) for α = K 1δX where K is the Gram matrix of all points in S.

Focusing on a ﬁnite S, while tempting from the perspective of theoretical convenience, can lead to unreliable methods in practice.

Example 4 (Sequence space with bounded length) If sequence space S includes only sequences X with length less than some maximum Lmax, that is S = Lmax L=0 BL, and the kernel is strictly positive deﬁnite, then the kernel has discrete masses. The problem is that we want our kernel methods to be reliable even as we observe or optimize longer and longer sequences. As well, if we choose Lmax to be astronomical, so that Proposition 8 is technically met, studying inﬁnite sequence space S = L=0BL is useful in that it forces us to consider the behavior of our methods as sequences grow in length.

5. Characterizing and manipulating kernels with discrete masses

We have seen that kernels with discrete masses have strong guarantees on their ability to approximate functions and distinguish distributions. However only a small number of kernels, with limited relevance for biological sequences, have been shown previously to have discrete masses on inﬁnite data spaces (Jorgensen and Tian, 2015). In this section we develop theoretical tools that can be used to prove that kernels have discrete masses. Section 5.1 describes conditions that are equivalent to having discrete masses; Section 5.2 describes transformations of kernels that preserve the discrete mass property. We will later apply these techniques to design new biological sequence kernels with discrete masses. Note in this section S is still an arbitrary inﬁnite discrete space, not necessarily the space of sequences.

5.1 Equivalent formulations of the discrete mass property

In this section we describe two equivalent formulations of the discrete mass property.

5.1.1 Conditions on the span of kernel embeddings

The ﬁrst formulation puts conditions on span{k Y }Y =X that guarantee δX Hk. The intuition is as follows. Consider some element of the span, P Y S αY k Y span{k Y }Y S. If δX Hk then (δX| P Y S αY k Y )k = P Y S αY δX(Y ) = αX, that is, we can think of δX as a function that takes every element of span{k Y }Y S to the coeﬃcient in front of k X. In order for this function to exist, it must be (1) well deﬁned and (2) bounded. For the function to be well deﬁned, k X must be linearly independent from {k Y }Y =X. For the function to be bounded, it must also be diﬃcult to approximate k X using elements in span{k Y }Y =X. This intuition can be formalized as follows.

Proposition 9 Let X S, and call KX the closure of span{k Y }Y =X. δX Hk if and only if Hk = KX, or in other words, k X / KX.

Proof Say δX Hk. If Y = X, (δX|k Y )k = δX(Y ) = 0. On the other hand, (δX|k X)k = 1. Thus, k X / KX. On the other hand, say k X / KX. Then the orthogonal compliment of KX is exactly one dimensional. Let φ : Hk R be the linear function projecting Hk to the orthogonal

Amin, Marks, and Weinstein

compliment of KX, scaled so that φ(k X) = 1. Then φ(k Y ) = 0 if Y = X, so, φ(k Y ) = δX(Y ) for all Y S. By the Riesz representation theorem, since φ is continuous, there is a f Hk such that for all Y S, f(Y ) = (f|k Y )k = φ(k Y ) = δX(Y ), so f = δX.

One implication of this result is that continuous kernels on Euclidean space cannot have discrete masses.

Proposition 10 (Continuous kernels on Euclidean space do not have discrete masses) If the kernel k : RD RD R is a continuous function, it does not have discrete masses.

Proof Consider a sequence Y1, Y2, . . . RD that converges to X RD. Then

k Yn k X 2 k = k(X, X) + k(Yn, Yn) 2k(X, Yn) 2k(X, X) 2k(X, X) = 0.

In other words, k X is in the closure of span{k Y }Y =X.

The theory of kernels with discrete masses is thus unique to inﬁnite discrete spaces such as biological sequence space.

5.1.2 Conditions on the kernel s Gram matrix

A second formulation of discrete masses comes from pioneering work by Jorgensen and Tian (2015). We give a concise restatement of the proof of their result here. The basic idea builds oﬀof Proposition 8, which says that on ﬁnite discrete spaces a strictly positive deﬁnite kernel has an invertible Gram matrix and thus discrete masses. On an inﬁnite space, we consider the invertibility of a sequence of Gram matrices, deﬁned on larger and larger ﬁnite subsets of S, to ensure the kernel has discrete masses.

Theorem 11 (Jorgensen and Tian, 2015) Let X S and let k be a strictly positive deﬁnite kernel on S. For a ﬁnite subset B S with X B, deﬁne the Gram matrix KB indexed by B such that KB,Y,Y = k(Y, Y ) for Y, Y B. Let CB,X = q K 1 B

X,X. Then, δX Hk if and only if sup B CB,X < where the supremum is over all ﬁnite B S. In this case, δX k = sup B CB,X.

Proof Deﬁne a functional φX on span{k Y }Y S that is αXk X + P Y =X αY k Y 7 αX. Note φX is well deﬁned since k is strictly positive deﬁnite. Now, if δX Hk, then for all f span{k Y }Y S, we have (δX|f)k = φX(f) so that φX is bounded. On the other hand, if φX is bounded, it can be continuously extended to all of Hk and must be equal to ( |f)k for some f Hk by the Riesz representation theorem. Then, f(Y ) = (k Y |f)k = φX(k Y ) = δX(Y ), so, f = δX. We will show that CB is the norm of φX restricted to span{k Y }Y B and the result will follow. span{k Y }Y B is a ﬁnite dimensional space with {k Y }Y B as a basis, and inner product (v|w) = v T KBw for v, w RB in this basis. Calling e X the indicator vector for X, φX(v) = (e T XK 1 B |v). Finally, we see that the square norm of φX is (e T XK 1 B |e T XK 1 B ) = K 1 B

X,X = C2 B.

Kernels for Biological Sequences

5.2 Transformations that preserve the discrete mass property

In this section we describe how to construct new kernels with discrete masses out of existing kernels with discrete masses. This allows us to construct large families of related kernels that all have ﬂexibility guarantees. Many of the transformations we consider are standard for all kernels (as described in e.g. Sch olkopf and Smola (2021)), and it will be straightforward to see they preserve discrete masses. We also introduce a new transformation reparameterizing alphabets that is particularly relevant for the analysis of biological sequence kernels.

5.2.1 Summing

First we consider linear combinations of kernels with discrete masses. The following result says that as long as one kernel in this sum has discrete masses, then so does the entire summed kernel.

Proposition 12 Say Λ is a measurable space and (kλ)λ Λ is a family of kernels. Assume for any X, Y S, λ 7 kλ(X, Y ) is measurable. Say ν is a positive, nonzero measure on Λ with R dν(λ)kλ(X, X) < for all X S and ν({λ | kλ has discrete masses}) > 0, i.e. ν has positive mass on kernels with discrete masses.3 Then, kν = R kλdν(λ) is a kernel on S that has discrete masses.

Proof We consider the possibility that kν does not have discrete masses and show that this leads to a contradiction. By Proposition 9, there is some X S and sequence µl = δX P Y =X αl,Y k Y , where P Y =X αl,Y k Y span{k Y }Y =X, such that R kν,Zdµl(Z) kν 0 as l . Then,

Z kν,Zdµl(Z)

kν = Z dµl(Z1) Z dµl(Z2)kν(Z1, Z2)

= Z dν(λ) Z dµl(Z1) Z dµl(Z2)kλ(Z1, Z2)

Z kλ,Zdµl(Z)

Thus, by Fatou s lemma,

Z dν(λ) lim inf l

Z kλ,Zdµl(Z)

In particular, there is a λ such that kλ has discrete masses and lim infl R kλ,Zdµl(Z) 2 kλ = 0, which is a contradiction.

Corollary 13 Let k0, k1, . . . k N be kernels such that k0 has discrete masses and α0, . . . , αN > 0. Then PN n=0 αnkn is a kernel with discrete masses.

3. If {λ | kλ has discrete masses} is not measurable, then we instead require that ν(C) > 0 for all C {λ | kλ has discrete masses}.

Amin, Marks, and Weinstein

5.2.2 Changing domains

Next we consider using diﬀerent kernels over diﬀerent regions of sequence space. The following result says that if a kernel has discrete masses over all of S, it also has discrete masses when restricted to just one region of S. As well, if we have separate kernels with discrete masses over separate orthogonal regions of S, we can then combine them to construct a new kernel with discrete masses over all of S.

Proposition 14 Say {WV }V is a collection of disjoint subsets of S such that S = V WV . If k has discrete masses, then it also has discrete masses when restricted to any WV . On the other hand, if k has discrete masses when restricted to each WV , and k(X, X ) = 0 for any X WV , X WV =V , then k has discrete masses over S.

Proof First assume k has discrete masses over S. A standard property of kernels, is that the Hilbert space of a kernel restricted to a domain WV , that is Hk|WV , is the closure of span{k Y }Y WV in the original Hilbert space Hk. Now consider any X WV . By Proposition 9, k X is not in the closure of span{k Y }Y S | Y =X, so it s also not in the closure of span{k Y }Y WV | Y =X. Thus, applying Proposition 9 again, k has discrete masses over WV . Now consider the case where k has discrete masses when restricted to each WV . Assume δX Hk for some X WV ; we will show this leads to a contradiction. By Proposition 9, there exists a sequence of functions f1 = P Y =X α1,Y k Y , f2 = P Y =X α2,Y k Y , span{k Y }Y =X such that fm k X as m . Let f V,m be the orthogonal projection of fm onto span{k Y }Y WV , that is f V,m = P Y =X 1(Y WV )αm,Y k Y . Note that for every m, we have f V,m span{k Y }Y WV | Y =X, so f V,m is in the Hilbert space Hk|WV of the kernel restricted to WV . Deﬁne f V,m = fm f V,m = P Y =X 1(Y WV )αm,Y k Y . Now (f V,m|f V,m)k = 0 and (f V,m|k X)k = 0, so

f V,m k X 2 k f V,m 2 k + f V,m k X 2 k = fm k X 2 k 0.

In other words, k X is in the closure of span{k Y }Y WV | Y =X. This implies that the kernel k restricted to WV does not have discrete masses, a contradiction.

5.2.3 Tensorizing

We next consider tensorizing kernels, so that they can be applied to pairs of sequences (Fukumizu et al., 2007). If we have two kernels k1 and k2 on S, the tensorized kernel k1 k2 is k1 k2((X1, X2), (Y1, Y2)) = k1(X1, Y1)k2(X2, Y2) for X1, X2, Y1, Y2 S. Tensorization preserves discrete masses.

Corollary 15 Let k1 and k2 be kernels on S with discrete masses. Then k1 k2 is a kernel on S2 with discrete masses.

Proof A standard property of kernels, which follows from the results in Sch olkopf and Smola (2021), says that if f1 Hk1 and f2 Hk2, there is a f1 f2 Hk1 k2 such that for (X1, X2) S2, we have f1 f2(X1, X2) = f1(X1)f2(X2). Say X, Y S. Then

Kernels for Biological Sequences

δ(X,Y ) = δX δY Hk1 k2.

5.2.4 Tilting

Next we consider re-weighting kernels to emphasize or de-emphasize certain areas of sequence space. More precisely, we consider tilting a kernel k by some function A : S (0, ) to obtain a new kernel k A(X, Y ) = A(X)k(X, Y )A(Y ) for X, Y S. The discrete mass property is preserved after tilting.

Proposition 16 If k is a kernel on S that has discrete masses, then k A has discrete masses.

Proof A standard property of kernels, which follows from the results in Sch olkopf and Smola (2021), is that if we have a function f( ) in the original RKHS, Hk, then the function A( )f( ) is in the RKHS of the tilted kernel, Hk A, and has the same norm, f k = Af k A. In other words, f 7 Af is an isometric isomorphism of Hk to Hk A. So for any X S, we have AδX Hk A. We can always multiply a function in an RKHS by a ﬁnite scalar. Thus, δX = 1 A(X)A( )δX( ) Hk A.

5.2.5 Reparameterizing alphabets

Finally, we look at a novel transformation of kernels, speciﬁc to biological sequences, that involves reparameterizing the alphabet B. The basic idea starts from representing letters in the alphabet B as one-hot encoded vectors, i.e. we treat each b B as a vector of length |B| with zeros everywhere except at one position. The alphabet B thus forms a basis of RB. However, we may also consider an alternative basis B of RB. This alternative basis gives rise to an alternative set of sequences S = L=0 BL. By treating S and S as sets of vectors, there will be a natural way to extend a kernel on S to S. We will see that the property of discrete masses is invariant to this change in basis: if the kernel has discrete masses over S, it also has discrete masses over S, and vice versa. (Note in this section and all following sections, S will be the set of sequences, rather than an arbitrary inﬁnite discrete space.) To be more precise, we must deﬁne what it means to apply a sequence kernel to vector encoding of a sequence. Any sequence Y in S = L=0BL can be represented as a one-hot encoding, which consists of a vector V R|Y | B such that V(l) RB is the one-hot encoding of the letter Y(l). If the vector V RL B is a one-hot encoding of a sequence then we deﬁne its embedding into Hk as the embedding of the sequence it encodes. We can write this using the formula

l=0 V(l),X(l)

Thus if V is a one-hot encoding of Y , we recover k V = k Y . The kernel has not changed; all we have done is rewrite it to embed vector encoded sequences. We now apply the same kernel to diﬀerent sequences that use diﬀerent encodings, which are not one-hot. In particular, let B be an alternative basis of RB. Each sequence Y in S = L=0 BL can be represented by a vector encoding V R| Y | B, where V(l) is the encoding

Amin, Marks, and Weinstein

of the letter Y(l). We deﬁne the kernel embedding of Y by plugging V into Equation 3; it consists of a linear combination of kernel embeddings k X from each sequence X S with |X| = | V |. We will show that if the kernel k has discrete masses over S, and we apply k to S, it will still have discrete masses. We call this shift from the alphabet B to B a reparameterization of the kernel s alphabet. Intuitively, in the case of proteins, we can think of the reparameterized alphabet as a set of amino acid properties, such as mass, charge, etc. Each amino acid is a letter in the original alphabet B, while each property is a letter in the reparameterized alphabet B; since both alphabets form bases of RB, each amino acid can be described as a linear combination of properties. We can analyze the ﬂexibility of a kernel over sequences of amino acids by analyzing the ﬂexibility of the same kernel applied to sequences of amino acid properties. This is useful theoretically for studying complex kernels.

Proposition 17 Say k is a strictly positive deﬁnite kernel on S, and B is a basis of RB. Then k has discrete masses as a kernel on S if and only if it has discrete masses as a kernel on S.

Proof Both B and B are bases of RB, so the kernel over S is a reparameterization of the kernel over S and vice versa. Thus, we only need to show that k has discrete masses as a kernel on S if it has discrete masses as a kernel on S. First, note that reparameterizing the alphabet does not change the span of the kernel embedding vectors, span(k V )V ( B)L =

span(k V )V (B)L, since both B and B are bases of RB. Now, consider some length L. If X S with |X| = L, then k X = P Y S | |Y |=L αX,Y k Y for some αX,Y . As k is strictly positive deﬁnite, {k X}|X|=L is a linearly independent set, so α is an invertible square matrix with dimension |B|L |B|L. Let f X = P Y S | |Y |=L α 1

X,Y δY . Then f X(Z) = 0 for Z S with |Z| = L, and if |Z| = L, f X(Z) = P Y S | |Y |=L α 1

X,Y αZ,Y = δX(Z). Thus δX Hk.

Throughout the rest of the paper, for any kernel on S and any V, W L=0RL B we will write k(V, W) = (k V |k W )k where k V and k W are deﬁned by Eqn 3.

6. Position-wise comparison kernels

In this section, we design kernels with discrete masses that compare sequences position-byposition. We saw in Fig 1 and 4.2 that the Hamming kernel, the weighted degree kernel, and other related kernels that compare sequences position-by-position are neither universal nor characteristic. Here, we develop alternative kernels that capture the same biological ideas but are also highly ﬂexible. Position-wise sequence comparison is ubiquitous in biology, and has strong biological justiﬁcation for many problems. For instance, a common observation is that nucleotides or amino acids at a speciﬁc position have a speciﬁc biological function; for instance, the amino acids at a few particular positions may chemically react to form a ﬂuorophore, making the protein ﬂuorescent. So, when predicting phenotype from sequence, a reasonable measure of sequence similarity is one that compares sequences position-by-position, as is done in the Hamming kernel. Moreover, a very common form of mutation during evolution is

Kernels for Biological Sequences

a substitution, which switches one letter for another. Thus, a position-wise measure of sequence similarity can capture evolutionary distance as well as phenotypic distance, which may be desirable, for instance, when comparing sequence distributions. Our new kernels preserve existing notions of position-wise sequence similarity. For example, as two sequences diﬀer more and more according to the Hamming kernel, they will also diﬀer more and more according to our proposed replacement for the Hamming kernel. What we modify is not the measure of sequence similarity but instead the functional form of the dependence of the kernel on sequence similarity, i.e. how exactly a change in the similarity of X and Y translates into a change in k(X, Y ). For existing kernels, the functional form rarely has strong biological justiﬁcation, and instead is often motivated solely by convenience. Our results demonstrate that the functional form in fact matters a great deal for the reliability of kernel methods for biological sequences. We start by studying a simple kernel that compares sequences position-by-position. We then use this base kernel to derive many other varieties of position-wise comparison kernels, using the transformations developed in Section 5.2. The base position-wise comparison kernel is deﬁned as the product of individual kernels applied to the letters at each position.

Deﬁnition 18 (Base position-wise comparison kernel) We represent each sequence X S as terminated by an inﬁnite tail of stop symbols $. Let ks be a strictly positive deﬁnite kernel on letters B {$} with ks($, $) = 1. Now, the base position-wise comparison kernel is

l=0 ks(X(l), Y(l)). (4)

Note that because ks($, $) = 1, the inﬁnite product is always ﬁnite. This kernel compares the sequences X and Y at each position l, according to ks. For example, one natural choice of ks is to set ks(b, b ) = 1 if b = b and ks(b, b ) = e λ if b = b , for λ > 0. This gives the exponential Hamming kernel ,

k(X, Y ) = exp

l 1(X(l) = Y(l))

= exp( λd H(X, Y )),

where d H is the Hamming distance. Recall that the Hamming kernel of lag L = 1 takes the form k(X, Y ) = |X| |Y | d H(X, Y ). Thus, the two kernels both measure sequence similarity using the Hamming distance, but diﬀer in the functional form of their dependence. The base position-wise comparison kernel, unlike the weighted degree kernel, has discrete masses.

Theorem 19 The base position-wise comparison kernel has discrete masses.

Proof Our proof strategy will be to reparameterize the alphabet B (using Proposition 17) such that the RKHS Hk decomposes into a product of orthogonal hyperplanes, each spanned by a subset of sequences. We prove that the kernel restricted to each of these hyperplanes takes a simple form, and has discrete masses. We then apply Proposition 14 to merge the separate hyperplanes and prove the result. We start by reparameterizing the base position-wise comparison kernel, k. First note that k is a strictly positive deﬁnite kernel by the Schur product theorem. Deﬁne K as the

Amin, Marks, and Weinstein

matrix with Kb ,b = ks(b, b ) for b, b B. Call t RB the vector with tb = ks(b, $) for b B. Let U = K 1t and call σ = UT KU. If V, W RB deﬁne ks,V = P b B Vbks,b, ks,W analogously, ks(V, W) = (ks,V |ks,W )ks and ks(V, $) = (ks,V |ks,$)ks. Then if V RB

ks(V, U) = V T KU = V T t = ks(V, $). Note in particular that since ks is strictly positive deﬁnite, we have the strict Cauchy-Schwartz inequality,

σ = ks(U, U) = ks(U, $) < p

ks(U, U)ks($, $) = σ,

which can only be the case if σ < 1. Let B1, . . . , B|B| 1 RB be chosen so that {U, B1, . . . , B|B| 1} is an orthogonal basis of the vector space RB when using the dot product (v|x) = v T Kw, with (Bi|Bi) = 1 for all i. Thus, ks(Bi, Bj) = δi(j) for all i, j and ks(U, Bi) = 0 for all i. Call B = {U, B1, . . . , B|B| 1} and S the set of sequences made up of letters in B. By Proposition 17, if we can show that the kernel k has discrete masses on S, we know that it has discrete masses on S. We now break up S into separate domains WV . Recall the base position-wise comparison kernel takes the form k(V, V ) = Q l=0 ks(Vl, V l ). When applied to sequences in the reparameterized alphabet, i.e. V, V S, we have for all b, b B {$} that ks(b, b ) = 0 if b = b except ks(U, $) = σ. Thus, if |X| |Y | = L for X, Y S then k(X, Y ) = 0 if and only if the ﬁrst L letters of X and Y are identical and all the letters of X after position L are U. If V S and does not end in U, deﬁne WV = {V + m U} m=0 to be all the sequences in S that start with V and have a tail of Us. Then if V S is a diﬀerent sequence that does not end in U, WV is orthogonal to WV in Hk. Thus Hk is made up of orthogonal hyperplanes spanned by the sets {WV }V . By Proposition 14, k has discrete masses if and only if k has discrete masses when restricted to each WV . We now show that the kernel applied to each WV has discrete masses. Note, since ks(U, $) = ks(U, U) = σ, we have k(V + m U, V + m U) = k(V, V )k(m U, m U) = k(V, V )σm m . Thus k restricted to WV is equivalent to the kernel k(n)(m, m ) = σm m

applied to m, m N, times a constant, k(V, V ). We therefore just need to show that k(n)(m, m ), a one-dimensional kernel deﬁned over the natural numbers, has discrete masses. We prove this by induction. First, noting σ < 1, let f = (σ 1) 1(k(n) 1 k(n) 0 ). We see that δ0 = f, so δ0 Hk. Now assume δ0, . . . , δx 1 Hk for some x N. Let fx = (σ 1) 1(k(n) x+1 k(n) x ). We see that fx(z) = 0 if z > x and fx(x) = 1. Thus δx = fx Px 1 z=0 f(z)δz, so δx Hk. Invoking Proposition 14, the proof is complete.

We can leverage the base position-wise comparison kernel to construct more complex position-wise comparison kernels that address speciﬁc modeling challenges, and also have discrete masses. In each of these examples, we use for simplicity ks(b, b ) = exp( λ1(b = b )), but one can just as well use any strictly positive deﬁnite kernel ks, for example one which compares b and b based on amino acid similarity.

Example 5 (Heavy tailed Hamming kernels) Practical challenge: The exponential Hamming kernel is thin-tailed in the sense that it decreases quickly (exponentially) as the Hamming distance increases. Consider a data set {X1, X2, . . .} consisting of distantly related sequences, and the Gram matrix K with entries Kij = k(Xi, Xj). Thin tails can lead to diagonal dominance , where the diagonal entries Kii are much larger than the

Kernels for Biological Sequences

oﬀ-diagonal entries, Kij for i = j. The kernel may therefore behave much like the identity kernel (Example 3), and exhibit similarly poor generalization when used in a Gaussian process or other regression method. Fixing diagonal dominance by tuning λ is diﬃcult; typically the only other regime available is one in which the matrix K is nearly constant, i.e. Ki,j constant for all i, j. In this case, the regression will make approximately the same prediction everywhere, and still exhibit poor generalization. Proposed kernel: To address diagonal dominance, we propose a heavy-tailed kernel, whose value decays slowly as sequence distance increases. Take β, C > 0. Then the inverse multiquadric Hamming kernel is deﬁned by integrating over the bandwidth parameter λ,

k(X, Y ) = (C + d H(X, Y )) β = Γ(β) 1 Z

0 dλ λβe Cλ e λd H(X,Y ), (5)

where Γ is the gamma function. As Hamming distance increases, this kernel decreases according to a power-law instead of exponentially, i.e. its tails are very heavy. By Proposition 12, this kernel has discrete masses.

Example 6 (k-mer Hamming kernels) Practical challenge: The exponential Hamming kernel considers two sequences similar if they have the same letter in the same position. However, the biological function of a sequence can depend not just on whether it has a certain letter in a certain position, but, further, on whether it has a certain string of letters (such as a motif) at a certain position. It may therefore be more appropriate to judge the similarity of two sequences based on whether they have the same kmers at the same position, rather than single letters. This is the idea behind the Hamming kernel of lag L > 1. The problem is that this kernel is neither characteristic nor universal (Appendix A). Proposed kernel We propose a kernel that uses the same measure of sequence similarity as the Hamming kernel of lag L, but takes a diﬀerent functional form, which ensures it has discrete masses. Let L be the context size, and deﬁne B = BL to be the set of sequences of length L. We will treat B as an alphabet, and consider the set of sequences made from this alphabet, S. Now consider the kernel k(X, Y ) = (C +d H(X, Y )) β on S for some C, β > 0; recall that this kernel has discrete masses. Each sequence in S (the original sequence space) can be uniquely embedded in S by the mapping i : S S, deﬁned as

h(X) = (X(:L), X(1:L+1), X(2:L+2), . . . ),

that is, h maps each sequence to its sequence of L-mers in order. Note h is injective, but it is not surjective. By Proposition 14, k restricted to the image of h has discrete masses. Thus the inverse multiquadric Hamming kernel of lag L ,

k(X, Y ) = k(h(X), h(Y )) =

|X| |Y | 1 X

l=0 1(X(l:l+L) = Y(l:l+L))

which is deﬁned for X, Y S, has discrete masses. Note that the sum in the above expression is precisely the Hamming kernel of lag L. In other words, the notion of sequence similarity has not changed, but the functional form of the kernel has, and this gives the new kernel discrete masses.

Amin, Marks, and Weinstein

Example 7 (Centre-justiﬁed Hamming kernels) Practical challenge The exponential Hamming kernel compares sequences position-by-position starting from position one. This is sensible when position one is a good reference point, such as the start of a gene. In some cases, however, we are interested in a region of sequence to either side of a reference point, for instance the DNA sequence upstream as well as downstream of the start codon (i.e. both promoter and coding regions). For example, rather than comparing sequences as,

|AACTTCT$$$$...

|GGACTTCTCT$...

where | marks the reference point, we want to compare them as,

...$$AACT|TCT$$$...

...$GGACT|TCTCT$...

Note that diﬀerent sequences may now have variable lengths to either side of the reference point |. Proposed kernel We can interpret each data point as consisting not of one sequence but rather a pair of two sequences, (X(1), X(2)), where X(1) is the sequence to the left of the reference point and X(2) to the right. Thus, each data point is in S2 rather than S. Starting with the exponential Hamming kernel (or any other kernel with discrete masses), we can extend it to S2 using tensorization, and Proposition 15 guarantees it has discrete masses.

Example 8 (Hamming kernels with shifts) Practical challenge It is not always clear which positions exactly should be compared between two sequences. For instance, while protein sequences often have a clear starting position (the start codon), for other genetic elements (such as enhancers) the beginning of the sequence is less well-deﬁned. One approach to this problem is to compare sequences under various oﬀsets relative to one another, as in the Hamming kernel with shifts (R atsch et al., 2005). The Hamming kernel with shifts, however, does not have discrete masses. Proposed kernel We can deﬁne a shifted position-wise comparison kernel as

k L(X, Y ) =

l=0 k(X(l:), Y ) + k(X, Y(l:)),

where k is, for instance, the exponential Hamming kernel. This kernel compares X to Y under various oﬀsets, i.e. starting from position l = 0, then position l = 1, etc., up to some maximum L; it will be large if at least one of these oﬀsets produces a good match. The kernel is a sum over individual kernels with discrete masses, so by Proposition 12 it has discrete masses.

Note each of these proposed kernels, in addition to having discrete masses, is computationally tractable, and in particular can be eﬃciently computed using the techniques developed for weighted degree kernels (Sonnenburg et al., 2007).

Kernels for Biological Sequences

7. Alignment kernels

In this section, we design kernels with discrete masses that compare sequences based on pairwise alignments. Whereas position-wise comparison kernels judge two sequences X and Y to be similar only if there are a small number of substitutions that can transform X into Y , alignment kernels judge two sequences to be similar if there are a small number of substitutions, insertions and/or deletions that can transform X into Y . Biologically, sequences that are similar in this way often share similar phenotypes, and are closely related evolutionarily. However, this notion of similarity is poorly captured by position-wise comparison kernels. For example, the sequence X = ATGC diﬀers from the sequence Y = TGC only by the insertion of a single A at the start of the sequence, but kernels based on the Hamming distance will judge the two sequences to be completely diﬀerent, with distance d H(X, Y ) = 4 = |X|. One popular, but problematic, technique for dealing with insertions and deletions is to pre-process the data set using a multiple sequence alignment algorithm, which adds gap symbols to each sequence to make it the same length. In essence, the idea behind multiple sequence alignment algorithms is to construct a point estimate of which letters in each sequence are evolutionarily related to one another via substitutions, and place each related letter in the same position, i.e. in the same column of the pre-processed data matrix. Once sequences have been aligned, it is more reasonable to apply a position-wise comparison kernel; insertions and deletions are compared by including the gap symbol in the alphabet B. The problem with multiple sequence alignment pre-processing is that (a) it does not take into account uncertainty in which positions are related to one another, instead relying on a point estimate, and (b) it prevents downstream machine learning methods from generalizing to unseen sequences, since adding new sequences to the data set can change the multiple sequence alignment, for instance, if the new sequence is longer than those previously observed (Weinstein and Marks, 2021). Alignment kernels oﬀer an alternative approach to accounting for insertions and deletions (Haussler, 1999). Rather than transform the entire data set, they consider two sequences X and Y to be similar if they diﬀer by a small number of insertions and deletions, as well as substitutions. Alignment kernels are of special relevance to problems involving sequences with high length variation, such as in the analysis of human antibodies or of distantly related evolutionary homologs. Alignment kernels are typically computed with a dynamic programming procedure comparing two sequences of length L costs O(L2) (Watkins, 2000). However, the algorithm can be made much more eﬃcient in practice with modest assumptions and the use of GPUs (Rush, 2020).

7.1 The alignment kernel

In this section we deﬁne the alignment kernel and show that it has discrete masses if and only if its hyperparameters are set to values in a certain range. In biological sequence analysis, a pairwise alignment between two sequences is a matching between a subset of positions in each sequence, with the restrictions that (a) each position in each sequence can be matched to at most one position in the other sequence, and (b) matchings must be ordered, such that if site l X in sequence X is matched to l Y in

Amin, Marks, and Weinstein

sequence Y , site l X > l X in X cannot match to a site l Y < l Y in Y (Durbin et al., 1998, chap. 2). For example, one alignment between the sequences X = ATGC and Y = ACC is,

where vertical lines | denote a matching. Here, - is a gap symbol, denoting the fact that the nucleotide G in X is unmatched; we can interpret the G as an insertion in X relative to Y , or, conversely, interpret the gap - as a deletion in Y relative to X. The alignment kernel k(X, Y ) considers all possible pairwise alignments between X and Y , scores each alignment according to a position-wise comparison kernel applied to the aligned sequences, and sums up those scores to produce its value. Mathematically, to deﬁne the alignment kernel, we start with two simpler kernels and then convolve them, following the construction of Haussler (1999). The ﬁrst kernel, ks, evaluates matches between aligned letters. The second kernel, k I, evaluates insertions and deletions; it applies an aﬃne gap penalty, which separately penalizes the presence of a gap and the gap s total length. For any two kernels k1 and k2 on S, the convolution kernel k1 k2 is,

(k1 k2)(X, Y ) = X

X(1)+X(2)=X

Y (1)+Y (2)=Y k1(X(1), Y (1))k2(X(2), Y (2))

where the sum is over all sequences X(1), X(2), Y (1), Y (2) S such that X(1) + X(2) = X and Y (1)+Y (2) = Y . Recall that S includes the sequence of length zero, so the sum includes such terms as X(1) = X, X(2) = .

Deﬁnition 20 (Alignment kernel) Let k s be a strictly positive deﬁnite kernel on B. Deﬁne ks by extending k s to S with ks(X, Y ) = 0 if |X| = 1 or |Y | = 1. Let µ > 0 be the penalty for starting an insertion and let µ > 0 be the penalty for the insertion s length. The insertion penalty function g is then deﬁned as g(X) = 1 if |X| = 0 and g(X) = exp ( µ |X|µ) otherwise. Deﬁne the non-negative-deﬁnite kernel k I(X, Y ) = g(X)g(Y ) for X, Y S. The alignment kernel k is,

l=0 k I (ks k I) l,

where the exponent l denotes the convolution of l kernels ks k I.

To see that this kernel indeed considers all possible pairwise alignments of X, Y S, note that we can rewrite k(X, Y ) as,

X(1)+ +X(2l+1)=X

Y (1)+ +Y (2l+1)=Y k I(X(1), Y (1))

i=1 ks(X(2i), Y (2i))k I(X(2i+1), Y (2i+1)).

Kernels for Biological Sequences

Each term in the sum represents an alignment. Each alignment has l matches, in which the even subsequences X(2), X(4), . . . , X(2l) of X are matched to the even subsequences Y (2), Y (4), . . . , Y (2l) of Y . For the term to be non-zero, each of these subsequences must consist of a single letter, since otherwise ks(X(2i), Y (2i)) = 0. Between each of the match positions in X and Y there may be insertions, corresponding to the odd subsequences X(1), X(3), . . . , X(2l+1) and Y (1), Y (3), . . . , Y (2l+1). These subsequences can be of any length, including zero. The example alignment between X = ATGC and Y = ACC we saw earlier in this section corresponds to X(1) = , X(2) = A, X(3) = , X(4) = T, X(5) = G, X(6) = C, X(7) = , Y (1) = , Y (2) = A, Y (3) = , Y (4) = C, Y (5) = , Y (6) = C, Y (7) = .

The kernel sums over all possible alignments: the outermost sum in Equation 6 is over the number of matches, l {0, 1, 2, . . .}, and the inner sums are over all possible choices of match positions in X and in Y . Each alignment is then scored according to ks and k I.

How ﬂexible is the alignment kernel? Haussler (1999) showed it is strictly positive deﬁnite. Jorgensen and Tian (2015), motivated by problems outside biological sequence analysis, showed that the binomial RKHS does not have discrete masses; this corresponds to the alignment kernel with µ = µ = 0, |B| = 1, and ks(b, b) = 1. We now show that the alignment kernel can have discrete masses if and only if µ, µ and ks satisfy certain constraints.

Theorem 21 (Alignment kernels can have discrete masses) Deﬁne K as the matrix with Kb,b = ks(b, b ) for b, b B, and σ = 1T BK 11B where 1B RB is the vector with 1 in each entry. If µ > 0, then k has discrete masses if and only if 2µ log σ. If µ = 0, then k has discrete masses if and only if 2µ > log σ.

The proof is in Appendix C. It uses a similar strategy to that employed in proving that the position-wise comparison kernel has discrete masses (Theorem 19): reparameterize the alphabet, then break up sequence space into orthogonal hyperplanes. Over the orthogonal hyperplanes, the kernel takes a simpler form, which we prove has discrete masses using the alignment kernel s feature representation; those features will be derived in Section 8.1 below.

Practically, Theorem 21 shows that if we set the hyperparameters of the alignment kernel naively, it may become unreliable, while if we set them appropriately, it has guaranteed ﬂexibility. In particular, if we want ﬂexibility, our choice of gap penalties (µ and µ) is constrained by our choice of substitution penalties (ks) through σ. If there is no penalty for starting a gap ( µ = 0) we require the penalty for extending a gap (µ) to be larger than 1 2 log(σ) in order for the kernel to have discrete masses. If there is a penalty for starting a gap ( µ > 0), the penalty for extending a gap can be slightly smaller, in that µ = 1

2 log(σ) also gives discrete masses. In Appendix D we make sense of these conditions as arising from the requirement that the position-wise comparison kernel applied to individual alignments has discrete masses. In Appendix F we investigate relaxations of these conditions, such that the alignment kernel no longer has discrete masses but is still universal; we ﬁnd that such relaxations are only valid given additional, problematic modiﬁcations of the kernel.

Amin, Marks, and Weinstein

7.2 Heavy-tailed alignment kernels

Practically, when applied to data sets with highly diverse sequences, regression methods that use the alignment kernel often exhibit poor generalization due to diagonal dominance (Haussler, 1999; Saigo et al., 2004; Weston et al., 2003). Previous eﬀorts to address this problem have encountered substantial diﬃculties; practitioners have gone so far as to introduce alternative kernels that are not positive semi-deﬁnite (Saigo et al., 2004). In this section, we diagnose possible sources of diagonal dominance in the alignment kernel, and propose heavy-tailed modiﬁcations of the alignment kernel that still possess discrete masses. The key observation is that the alignment kernel, in eﬀect, applies a position-wise comparison kernel to each pairwise alignment between sequences; diagonal dominance stems from the fact that the tail of this position-wise comparison kernel is thin. More precisely, consider an alignment kernel (Deﬁnition 20) with a letter kernel k s(b, b ) = exp( λ1(b = b )); recall this choice gave us the exponential Hamming kernel in Section 6. Each term in the sum deﬁning the alignment kernel (Equation 6) corresponds to an individual alignment, and contains a factor corresponding to the exponential Hamming kernel applied to the matched positions, namely Ql i=1 ks(X(2i), Y (2i)). As the Hamming distance between the matched positions increases, the term decays very quickly (exponentially), giving rise to diagonal dominance. To ﬁx this problem, we follow the same strategy as in Example 5, and integrate over the bandwidth parameter λ to derive a thick-tailed alternative.

Example 9 (Heavy tailed alignment kernel on matches) The thick tailed alignment kernel on matches is the kernel k(X, Y ) = Γ(β) 1 R 0 dλ λβe Cλ kλ(X, Y ), where kλ is the alignment kernel using k s with parameter λ, and C, β > 0. From Equation 5, we ﬁnd that k(X, Y ) is,

X(1)+ +X(2l+1)=X

Y (1)+ +Y (2l+1)=Y (C + d H(X(2) + + X(2l), Y (2) + + Y (2l))) β

k I(X(1), Y (1))

i=1 k I(X(2i+1), Y (2i+1))

In this kernel, the dependence on the Hamming distance between matched positions follows a power law, rather than an exponential decay. If we deﬁne Kλ,b,b = kλ(b, b ) then σλ =

1T BK 1 λ 1B = 1 |B| + |B| 1

|B| e λ 1 , which approaches 1 as λ 0. Thus, as long as µ > 0,

kλ will have discrete masses for small λ, by Theorem 21. So, by Proposition 12, k has discrete masses. Note that this modiﬁed kernel can, like the standard alignment kernel, be computed eﬃciently using a dynamic programming algorithm (Appendix I) in O(L3) time if L = |X| = |Y |. However the algorithm can be potentially be made more eﬃcient with minor modiﬁcations and the use of GPUs (Rush, 2020).

Besides the substitution penalty λ, we can also consider the role of the insertion penalties µ and µ. We ﬁnd a similar issue: as the length of insertions increases, each term of the alignment kernel decays exponentially, since k I(X, Y ) = exp 1(|X| > 0)( µ + µ|X|) 1(|Y | > 0)( µ + µ|Y |) . Now, the parameter µ plays the role of the bandwidth. To produce thick-tails, we can perform the analogous transform, and take the integral

Kernels for Biological Sequences

Γ(β) 1 R 0 dµ µβe Cµ kµ(X, Y ), where kµ is the alignment kernel with parameter µ. This gives us a heavy-tailed alignment kernel on gaps, which has discrete masses by Proposition 12. We could even apply the same transform to the thick tailed alignment kernel on matches, yielding a thick tailed alignment kernel on both gaps and matches.

7.3 Local alignments

Genomes are very long, and so in practice machine learning is typically done only on speciﬁc genetic elements, such as genes, promoters, enhancers, etc. It is sometimes ambiguous where exactly these genetic elements start or end. This motivates the idea of a local alignment, which ignores insertions and deletions at the start and end of a sequence. Using a local alignment as a measure of similarity means two sequences that contain the same genetic element but diﬀerent ﬂanking regions will still be considered similar (Durbin et al., 1998, Chap. 2). In this section, we study a local alignment kernel, and establish conditions under which it has discrete masses (Vert et al., 2004; Saigo et al., 2004). The local alignment kernel is a modiﬁcation of the alignment kernel, which does not penalize the creation of gaps at the start or end pairwise alignments.

Deﬁnition 22 (Local alignment kernel) Let k I and ks be as in Deﬁnition 20. Also deﬁne the modiﬁed insertion gap penalty kernel k I(X, Y ) = exp( µ(|X|+|Y |)), which lacks the gap start penalty µ. The local alignment kernel is,

kla = k I +

l=1 k I (ks k I) (l 1) ks k I.

We can extend the logic of the proof of Theorem 21 to establish conditions under which local alignment kernels have discrete masses; these conditions turn out to be identical to those for the regular alignment kernel.

Theorem 23 Deﬁne K as the matrix with Kb,b = ks(b, b ) for b, b B, and σ = 1T BK 11B where 1B RB is the vector with 1 in each entry. If µ = 0, kla has discrete masses if and only if 2µ > log σ. If > µ > 0, kla has discrete masses if and only if 2µ log σ. If µ = , kla has discrete masses regardless of the values of µ, σ.

A proof is given in Appendix H. Note that for the standard, non-local alignment kernel, if we set the gap start penalty to inﬁnity ( µ = ), we recover a position-wise comparison kernel. For local alignments, however, the situation is more complex, as there are diﬀerent options for the number of gaps to put at the start and end of each sequence. We ﬁnd that when µ = , the local alignment kernel has discrete masses regardless of the value of its other hyperparameters.

8. Kmer spectrum kernels

In this section we study kmer spectrum kernels. Kmer spectrum kernels, like alignment kernels, are motivated by the biological observation that sequences which diﬀer by insertions, deletions, or other complex mutations often share similar phenotypes and are related

Amin, Marks, and Weinstein

evolutionarily. Remarkably, we will ﬁnd that the relationship between kmer spectrum kernels and alignment kernels runs deeper than their motivation: the two are in certain cases equivalent, up to an appropriate tilting. Kmer spectrum kernels featurize sequences according to the presence and absence of subsequences (kmers). Since the position of the kmer within the sequence does not aﬀect the value of the corresponding feature, kmer spectrum kernels can be relatively insensitive to insertions or deletions, as compared with position-wise comparison kernels. In Example 1, we saw a commonly used kmer spectrum kernel which had only a ﬁnite number of features, and so was neither universal nor characteristic. Here we will study inﬁnite-dimensional kmer spectrum kernels, and show that they have discrete masses. When k is not very small or gaps are included, the most popular algorithms to evaluate these kernels on two sequences of length L use dynamic programming and run in O(k L2) (Lodhi et al., 2002).

8.1 Gapped kmer spectrum kernels

We start by connecting alignment kernels to kmer spectrum kernels. The basic connection between the two kernels has led to similar dynamic programming methods to compute each (Lodhi et al., 2002) and has been noted in a special case to motivate neural network architectures (Chen et al., 2019). Here we formally describe the connection in terms of an orthonormal basis of the RKHS. We show that, for a particular tilting, the features of the alignment kernel are a weighted sum of kmer counts, allowing for some kmer mismatches. More precisely, we show that the tilted alignment kernel is equivalent to an inﬁnite-dimensional gapped kmer spectrum kernel. Since tilting preserves discrete masses (Proposition 16), this gapped kmer spectrum kernel has discrete masses. Intuitively, a gapped kmer is a kmer that includes gaps; it matches another sequence if each of the non-gap characters match, regardless of the letters at the gap positions. For example, the gapped kmer sequence AT-G matches ATCG, ATGG, ATAG and ATTG. Formally, a gapped kmer of length L, with M non-gap letters, is deﬁned by an ordered set of indices J = (j0, . . . , j M+1), with 1 = j 1 < j0 < j2 < < j M < j M = L, and a sequence of M non-gap letters, Z BM. Each jm for m {0, . . . , M 1} indexes the location of the non-gap letter Z(m). There is a gap after the mth letter if jm + 1 = jm+1. Let G(L, M, g) be the set of all gapped kmer indices J, with length L, M non-gap letters, and g gaps. Let G(L, g) = MG(L, M, g) be the union over all possible M. For any sequence X S, and J = (j0, . . . , j M+1) G(|X|, g), deﬁne X(J) = X(j0) + + X(j M 1). The following theorem equates alignment kernels to gapped kmer spectrum kernels. In particular, it shows that we can represent alignment kernels in terms of features u V (X) that depend on the number of times a sequence V S appears in X, allowing for gaps. In other words, each feature u V (X) depends on a sum over the counts of all gapped kmers with non-gap letters V .

Theorem 24 (The features of alignment kernels are gapped kmer counts) Consider an alignment kernel k with ks(b, b ) = σ 1|B|δb (b) for b, b B. Tilting the kernel by A(X) = eµ|X| gives k A(X, Y ) = A(X)A(Y )k(X, Y ). For all sequences X S, deﬁne the

Kernels for Biological Sequences

features {u V (X)}V S, given by

u V (X) = e 1 2 ζ|V | X

J G(|X|,g) 1(X(J) = V ), (7)

where ζ = 2µ log σ + log |B|. Now, {u V (X)}V S is an orthonormal basis for Hk A. We call the kernel k A(X, Y ) the gapped kmer spectrum kernel .

The proof is in Appendix E, including the case of general ks. This theorem shows that the features of the tilted alignment kernel depend on a weighted sum over gapped kmer counts. The weights e µg decrease as the number of gaps increases, thus penalizing gapped kmers with large numbers of gaps. Since tilting preserves discrete masses, we can conclude that the gapped kmer spectrum kernel has discrete masses if and only if the conditions of Theorem 21 are satisﬁed. The tilting in Theorem 24 down-weights longer sequences, but does not change how the alignment kernel judges sequence similarity. Indeed, the theorem is straightforward to extend to other tiltings, such as the popular normalizing tilting A(X) = p

k(X, X) 1. In

this case we ﬁnd k A(X, Y ) = u(X)T u(Y ) u(X) u(Y ) , where u(X) is the inﬁnite dimensional vector indexed by S with u V (X) at position V S. The theorem can also be extended to nondiagonal ks, in which case the features are the gapped kmer counts with a reparameterized alphabet (Appendix E). Theorem 24 provides a practical example of a kmer spectrum kernel with discrete masses. It is also useful as a theoretical tool. Theorem 39 (Appendix G) uses Theorem 24 to establish that a simple version of the alignment kernel has discrete masses, which in turn allows us to prove that the full alignment kernel has discrete masses (Theorem 21). The proof technique used in Theorem 39 relies on a careful analysis of the combinatorics of biological sequence alignments, using the technique of generating functions and the theory of the Riordan group (Shapiro et al., 1991).

8.2 Heavy-tailed gapped kmer spectrum kernels

We now re-examine the diagonal dominance problem in alignment kernels, in light of Theorem 24. Equation 7 says that the feature weights, namely exp(1

2ζ|V |), increase exponentially with kmer length |V |. This exponential scaling means the the kernel k(X, Y ) takes a very large value when applied to very similar sequences X and Y , as compared to its value when X and Y are very diﬀerent and only have short kmers in common. This can result in diagonal dominance. Trying to avoid the problem by making ζ small leads to loss of ﬂexibility, as the kernel no longer has discrete masses when ζ < log |B|. In other words, ζ acts as a tuning parameter that trades oﬀkernel ﬂexibility and diagonal dominance. To address the problem, we propose a gapped kmer kernel with discrete masses in which the weighting on features scales as a power law with respect to kmer length, rather than exponentially.

Example 10 (Heavy tailed gapped kmer spectrum kernel) Call the unscaled features u V (X) = e |V |ζ/2u V (X) and notice they do not depend on ζ. Consider a gapped kmer spectrum kernel tilted by A(X) = e 1

kζ(X, Y ) = e 1

V S u V (X)u V (Y ) = X

2 (|X|+|Y |)+|V | u V (X) u V (Y ).

Amin, Marks, and Weinstein

Then for C, β > 0, the heavy tailed gapped kmer spectrum kernel is deﬁned as,

k(X, Y ) =Γ(β) 1 Z

0 kζ(X, Y ) ζβe Cζ dζ

2(|X| + |Y |) |V | β u V (X) u V (Y ).

In this kernel, shorter features V are down-weighted following a power law, rather than exponential decay. Proposition 12, Theorem 21 and Theorem 24 together ensure the kernel has discrete masses. The kernel can be computed eﬃciently using the dynamic programming algorithm given in Appendix I in O(L3) time if L = |X| = |Y |.

Note that since ζ depends on µ, this kernel is closely related to the heavy tailed alignment kernel on gaps discussed in Section 7.2, which integrates over µ with the same base measure. The exponential decay in pairwise alignment scores with increasing gap length, and the exponential decay in feature weight for shorter kmers, are two ways of looking at the same diagonal dominance problem in alignment kernels. An analogous construction can also be used to create a heavy-tailed local alignment kernels.

8.3 Ungapped kmer spectrum kernels

We now consider ungapped kmer spectrum kernels, whose features correspond to the number of times a kmer appears exactly in a sequence. In particular, we show that a local alignment kernel with an inﬁnite gap start penalty, µ = , is an ungapped kmer spectrum kernel, and that this kmer spectrum kernel therefore has discrete masses.

Proposition 25 (Inﬁnite kmer spectrum kernel) Consider a tilted local alignment kernel k = k A la, with µ = , ζ = 0, and A(X) = exp(µ|X|). This kernel has discrete masses and can be written as k(X, Y ) = X

V S ula,V (X)ula,V (Y ),

where the feature ula,V (X) = P|X| |V | l=0 1(X(l:(l+|V |)) = V ) is the number of times the kmer V occurs in X.

Proof Deﬁne Gla(L, M, g) as those elements in G(L) = g G(L, g) that are of length M with g gaps, not counting gaps at the beginning or end. Using the same logic as Theorem 24, the features of the kernel, after tilting, are, for V, X S

ula,V (X) =

J Gla(|X|,g) 1(XJ = V ).

Now if we set µ = , these features are simply the number of occurrences, without gaps, of V in X. By Theorem 23 and Proposition 16, the kernel has discrete masses.

The standard kmer spectrum kernel, by comparison, is k(X, Y ) = P |V | L ula,V (X)ula,V (Y ) (Example 1). That is, it uses a ﬁnite rather than an inﬁnite sum over kmers, and so does not possess any ﬂexibility guarantees. The inﬁnite kmer spectrum kernel has strong ﬂexibility

Kernels for Biological Sequences

guarantees, without sacriﬁcing much computational eﬃciency: we can compute it using the same dynamic programming methods used for local alignment kernels, since Proposition 25 shows it is a local alignment kernel (Saigo et al., 2004). Note also that since the inﬁnite spectrum kernel is an alignment kernel, it can be evaluated for two sequences of length L with compute O(L2), faster than the standard kmer spectrum kernel.

9. Embedding kernels

In this section we study embedding kernels, which use sequence representations in Euclidean space to judge sequence similarity. A key motivation for such kernels is to avoid the need for hand-crafted similarity measures, and to instead learn a similarity measure from data. The data used to learn a representation need not be the same data the kernel is applied to; for example, a common transfer learning technique is to use a large, unlabeled sequence data set to learn a sequence representation, and then, on a small labeled data set, learn a regression from representations to labels (Yang et al., 2018; Alley et al., 2019; Biswas et al., 2021; Rao et al., 2019; Detlefsen et al., 2022). Popular modern unsupervised representation learning methods include variational autoencoders, recurrent neural networks, and transformers. Such techniques have often been found to produce low-dimensional representations that reﬂect biological properties well, in the sense that sequences with similar representations have similar phenotypes. An embedding kernel consists of a continuous kernel applied to sequence representations.

Deﬁnition 26 (Embedding kernel) Let F : S RD be an embedding function, which maps sequences to representations. The embedding kernel is deﬁned as k(X, Y ) = k E(F(X), F(Y )), where k E is a kernel on RD RD.

For example, if we use a variational autoencoder to learn a representation, and a Gaussian process to predict labels, F would be given by the encoder and k E would be the kernel in the Gaussian process. Deep kernel learning also ﬁts the form of an embedding kernel, with F a neural network (Wilson et al., 2016). If we regress directly from sequence to labels using a neural network, we can think of the map from sequence space to the ﬁnal hidden layer of the network as the embedding function F, while the map from the hidden layer to the output is approximately a kernel regression, by the argument that single layer neural networks converge to Gaussian processes in the inﬁnite width limit (Neal, 1996; Williams, 1996).

9.1 Universality and characteristicness

We ﬁrst describe conditions under which embedding kernels are universal and characteristic. Intuitively, if we choose a kernel k E in Euclidean space that is universal and characteristic, then the embedding kernel over sequence space will be universal and characteristic as well, so long as the embedding function F maps each sequence in S to a unique point in RD.

Proposition 27 (Universal and characteristic embedding kernels) Assume k E(z, z ) = Ψ(z z ) is a translation invariant kernel and that Ψ is a positive continuous function on RD

Amin, Marks, and Weinstein

that has a strictly positive Fourier transform.4 Then the embedding kernel k is characteristic and C0-universal if and only if F is injective.

The proof is in Appendix J. Note that the conditions on k E ensure that it is universal and characteristic over RD (Sriperumbudur et al., 2010, 2011). The condition that F is injective conﬂicts with some representation learning approaches, particularly those aimed at learning sparse, interpretable features. For the sake of interpretability, it is often desirable to learn a representation that ignores certain features of the data points entirely, such as those features which vary little in the training data. However, when this representation is applied to make predictions about held out data, those sequences that do vary in the ignored features can be mapped to the same point in the embedding space. Since the embedding F is therefore not injective, downstream supervised predictors will be limited in their ﬂexibility. An example is in Appendix K. By contrast, a completely uninterpretable representation, which makes no attempt to organize embedding space by a biological notion of sequence similarity, can easily yield universal and characteristic embedding kernels. In particular, if we use an embedding function that gives sequences random representations, distributed according to a continuous distribution on RD, then the chance of any two sequences having the same embedding is zero.

Proposition 28 (Random embeddings give universal and characteristic kernels) Assume {F(X)}X S are i.i.d. random variables distributed according to a measure which is absolutely continuous with the Lebesgue measure. Then F is almost surely injective, and so an embedding kernel based on F is almost surely universal and characteristic.

Proof Number the elements of S, such that {X1, X2, . . .} = S. For any n N, we have that F({X1, . . . , Xn}) is a set of measure zero, so p(F(Xn+1) = F(Xi) for some i n) = 0. Thus, p(F(Xi) = F(Xj) for some i, j) = 0.

One implication of this proposition is that the meaningless randomness introduced through stochastic training of large neural networks may help them produce representations that are useful for downstream tasks.

9.2 Discrete masses

We have so far described conditions under which embedding kernels are characteristic and universal; we found that random embeddings suﬃce. However, we saw in Example 3 an embedding kernel that is universal and characteristic but cannot reliably be used for optimization, as it does not metrize the space of distributions. In this section, we address the question of when embedding kernels have discrete masses, and thus metrize P(S). We ﬁnd that many standard representation learning techniques are likely to produce embedding kernels that lack discrete masses even when they are universal and characteristic, and propose a ﬁx.

4. Ψ has strictly positive Fourier transform if ˆΨ(ξ) > 0 for all ξ where ˆΨ is the Fourier transform of Ψ. For example, Ψ(x) = exp( x2) has a strictly positive Fourier transform.

Kernels for Biological Sequences

For an embedding kernel to have discrete masses, the embedding function F must not only map diﬀerent sequences to diﬀerent points in RD those points must also be suﬃciently spread out.

Proposition 29 (Embedding kernels with discrete masses) Consider an embedding kernel with an injective F that meets the conditions of Proposition 27. k E metrizes P(S) if and only if k E has discrete masses, which occurs if and only if F(S) has no accumulation points, that is, there is no X S such that F(X) is in the closure of F(S \ {X}).

The proof is given in Appendix J. In Example 2, we saw an embedding kernel that was universal and characteristic, but which had an accumulation point at the sequence A, and thus did not metrize P(S). Unlike for universality and characteristicness, to construct an embedding kernel with discrete masses, it is not suﬃcient to use sequence representations that are drawn i.i.d. from a distribution on RD.

Proposition 30 (I.i.d. embeddings do not have discrete masses) Assume {F(X)}X S are i.i.d. random variables. Then the embedding function has accumulation points almost surely, and so does not metrize P(S).

Proof Consider any X0 S, and note that almost surely, p( F(X) F(X0) < ϵ) > 0 for any X, ϵ. Thus, p( F(X) F(X0) < ϵ for some X) = 1 for all ϵ. So, with probability 1 there are sequences Xn1, Xn2, . . . which are not X0 but where F(Xnk) F(X0) as k .

Intuitively, to construct an embedding without accumulation points, one must somehow ﬁt sequence space S into RD without letting the representations of each sequence get too close together. One approach is to make embeddings of longer and longer sequences more and more spread out, rather than sampling all the embeddings from the same distribution.

Proposition 31 (Scaled random embeddings have discrete masses) Consider an initial embedding F where each F(X) for X S is drawn from the uniform distribution on the sphere, {x RD | x 1}. Then, a kernel using the scaled embedding F(X) = |B|(1+ϵ)|X|/D F(X), for ϵ > 0, metrizes P(S) almost surely.

The proof can be found in Appendix L. Intuitively, the idea is that this rescaling gives each sequence representation enough space . If V (B(1)) is the volume of the unit ball in RD, we can think of F as embedding the set BL into a ball of radius |B|(1+ϵ)L/D. Thus, the average amount of volume per sequence is V (B(1))(|B|(1+ϵ)L/D)D/|B|L = V (B(1))|B|Lϵ, which is growing with L. This is just enough space to ensure there is no accumulation. Practically, Proposition 30 suggests that many representation learning approaches are unlikely to yield kernels that metrize P(S). Typically, as we embed more and more sequences, their representations stay clustered in a localized region of embedding space near the origin, rather than diverging away. Indeed, in many applications of representation learning, this behavior is desirable, as it makes the representations easier to visualize and interpret. One way to understand this behavior is as a consequence of the priors and regularizers typically used in representation learning. For instance, in latent variable models, we expect the set of embeddings {F(X)}X S, taken as a whole, to look roughly like a set

Amin, Marks, and Weinstein

of samples from the prior on the latent variable; since the prior is typically chosen to be a continuous distribution on RD, such as a normal distribution, Proposition 30 suggests the embedding kernel will not have discrete masses. The fact that we only observe a ﬁnite number of data points in the training data can also hurt: even if the representations of the training data are well spread out, the representations of sequences far from the training data are likely to look roughly like samples from the prior. In short, typical representation learning procedures are unlikely to give sequences enough space to produce embedding kernels with discrete masses. Proposition 31 suggests a simple ﬁx: rescale the embedding. Given an embedding F that does not become more spread out with increasing sequence length, modify it to F(X) = |B|(1+ϵ)|X|/D F(X), for ϵ > 0. The resulting kernel will, by the reasoning of Proposition 31, likely have discrete masses. In Section 10.3 we illustrate the eﬀectiveness of this trick in practice. Another approach would be to use a length-dependent prior, or length-dependent regularization, when learning the embedding in the ﬁrst place.

10. Empirical Results

In this section we examine the performance, on real biological sequence datasets, of some of our proposed kernels with discrete masses. We compare each to an existing kernel that relies on a similar notion of sequence similarity but lacks discrete masses. We ﬁnd that our theoretical guarantees on kernel ﬂexibility lead to systematic improvements in performance on ﬁnite data. We consider eight kernels across each of the four families we studied, and three machine learning tasks. These include tasks where kernel methods are already standard in biology (regression, distribution comparison) and tasks where our guarantees enable novel methods that are new to biological applications (choosing representative sequences). Code can be found at https://github.com/Alan Nawzad Amin/Kernels-with-guarantees/.

10.1 Predicting transcription factor binding

We ﬁrst consider a regression problem, where the aim is to predict whether a transcription factor binds a DNA sequence. Understanding transcription factor binding is critical for understanding and controlling gene expression. The data consists of pairs of DNA sequences and transcription factor binding strengths, measured in terms of the intensity of a ﬂuorescent signal in a micro-array assay (Barrera et al., 2016). We examined three separate data sets from three distinct transcription factors, the homeobox proteins Hox-D13, Nkx25, and Esx1. All the DNA sequences are length 8. We ﬁt the data using kernel regression, and evaluate success using the root mean squared error on the training data. Our aim is to compare kernel ﬂexibility; more ﬂexible kernels should provide better ﬁts to the data. We ﬁrst consider alignment kernels, and, applying Theorem 21, compare a kernel that has discrete masses (2µ < log σ) to one that does not (2µ > log σ). We ﬁnd that the version with discrete masses has better performance in the large data regime across all three example datasets (Fig. 2). The version without discrete masses eventually starts to plateau with increasing data, while the version with discrete masses reliably approaches the correct answer in Fig. 2b and 2c. We next consider kmer spectrum kernels, and again compare a kernel that has discrete masses (the inﬁnite kmer spectrum kernel, Proposition 25) to one that does not (the ﬁnite

Kernels for Biological Sequences

(c) Hox-D13

Figure 2: Kernel regression applied to transcription factor binding Kernel regression performance, comparing alignment and kmer spectrum kernels with and without discrete masses ( w/δ versus w/o δ ). N is the size of the sub-sampled data set. Performance is measured by root mean squared error, normalized by the standard deviation of the outcome variable.

kmer spectrum kernel of Example 1, with L = 3). Here again we ﬁnd that the version with discrete masses has better performance in the large data regime, while the version without can plateau at a substantial error level (Fig. 2). While in the case of the alignment kernels, the version with discrete masses showed moderately worse performance in the low data regime, here the version with discrete masses performed as well or better than the version without across the full range of data set sizes considered. In short, guaranteeing discrete masses leads to more reliable regression, especially but not exclusively in the large data regime.

10.2 Distinguishing immune repertoires

In this section we consider a testing problem, where the aim is to distinguish between patients T cell receptor (TCR) repertoires. T cells play a central role in the human adaptive immune system, recognizing foreign antigens through their TCR and then triggering an immune response. Here, we compare two diﬀerent patients immune systems by comparing the distribution of their TCR sequences. For each patient, we have a data set of TCR CDR3 sequences, which vary in length from 10 to 19 amino acids (10x Genomics, 2022). We apply a two-sample test by performing a bootstrap for degenerate U-statistics on the MMD, with the null hypothesis that the patients TCRs are drawn from the same distribution (Arcones and Gine, 1992). We evaluate performance based on how often the test correctly rejects the null hypothesis at 90% conﬁdence, averaging over 25 random sub-samples of a large data set. Our aim is to compare kernel ﬂexibility; more ﬂexible kernels should be able to distinguish between two patients repertoires more reliably. We apply the same four kernels from Section 10.1. Although all kernels were able to distinguish the two distributions eventually with enough data, the kernels with discrete masses were able to distinguish reliably with much less data (Fig. 3). In other words, guaranteeing discrete masses improves the power of the two-sample test on real data sets.

Amin, Marks, and Weinstein

10 20 40 80 N

fraction correct

alignment w/ alignment w/o spectrum w/ spectrum w/o

Figure 3: MMD two sample test applied to patient TCR repertoires. MMD two-sample test performance, comparing alignment and kmer spectrum kernels with and without discrete masses ( w/δ versus w/o δ ). N is the size of the sub-sampled data set. Performance is measured by the fraction of independent data sub-samples for which the null hypothesis is, correctly, rejected.

Figure 4: Optimizing representative TCR sequences using embedding kernels. Distribution optimization performance, comparing unscaled embedding kernels (likely to not have discrete masses) to scaled embedding kernels (likely to have discrete masses). (a) MMD per step during optimization. Here, the MMD is normalized to the MMD of the initial sequence X. Optimization was repeated with 10 diﬀerent initial sequences. (b) Length of the optimized sequence during optimization. The maximum, minimum and average value of the target sequence lengths are shown in green. (c) Examples of sequences from the target set and of optimized sequences.

Kernels for Biological Sequences

10.3 Optimizing representative receptors

In this section we consider a distribution optimization problem, where the aim is to ﬁnd a TCR sequence that is representative of a larger set. Often biologists are interested in understanding the properties of many sequences for instance, the TCRs from all the T cells that inﬁltrated a patient tumor but have limited experimental resources, and so can only synthesize and test a small number in the laboratory. Here, we aim to ﬁnd a single sequence X that is representative of a large set of TCR sequences, Y1, . . . , YN. We do so by optimizing δX to be close to the empirical distribution p Y = 1 N PN n=1 δYn, with the optimization objective MMD(δX, p Y ). We use N = 100 human TCR CDR3 sequences, with lengths varying between 10 and 17, as the target set (10x Genomics, 2022). We initialize X at a sequence that is longer than those in the target set, but has similar amino acid composition and arrangement (in particular, a random CDR3 from the target set, concatenated to itself). We then optimize MMD(δX, p Y ) by taking the best substitution, insertion or deletion of a single amino acid at each step, for 100 steps. Our aim is to compare kernels ability to metrize P(S). Since the main diﬀerence between the initial X and the target set is its length, we are particularly interested in seeing if optimizing MMD(δX, p Y ) yields a sequence X that is similar to the target set in length. We focus on embedding kernels. We use a pre-trained embedding function F : S R64, Uni Rep , learned by a deep recurrent neural network trained on a large data set of diverse protein sequences drawn from across evolution (Alley et al., 2019). We compare embedding kernels that use F directly (which are likely to be universal and characteristic by Proposition 27 but are unlikely to have discrete masses, by Proposition 30) to those that use the scaled embedding F(X) = 20(1+ϵ)|X|/64 F(X) with ϵ = 0.1 (which are likely to have discrete masses, by Proposition 31). For both embedding functions, we construct an embedding kernel using the Euclidean inverse multiquadric kernel, k E(x, y) = (1 + x y 2 2) 1, which satisﬁes the conditions of Proposition 27. For both embedding kernels, our optimization routine converges in MMD within the ﬁrst 50 steps (Fig. 4A). The unscaled kernel yields optimized sequences X that are much longer than those in the target set, while the scaled kernel, which is likely to have discrete masses, yields optimized sequences X at the target set s average length (Fig. 4B). Visual inspection of the optimized X also suggests that the scaled kernel produces substantially more representative designs; for instance, the optimized sequences from the scaled kernel have a cysteine in the ﬁrst position, like all the target sequences, but the optimized sequences from the unscaled kernel do not (Fig. 4C). In other words, rescaling the kernel to add discrete masses improves its ability to metrize P(S) in practice.

11. Discussion

In this article, we studied the ﬂexibility of kernels for biological sequences. Our aim was to ﬁnd kernels that are universal, characteristic, and metrize P(S), which guarantee they will be reliable when applied to machine learning problems involving regression, distribution comparison, and distribution optimization. We found that many existing kernels fail to meet one or all of these criteria, but that all three are met if the kernel has discrete masses. We then modiﬁed existing kernels to ensure they have discrete masses, imbuing them with strong theoretical guarantees and improved empirical performance.

Amin, Marks, and Weinstein

In applications on Euclidean space, issues of kernel ﬂexibility are now rarely discussed, and the reliability of kernel-based methods like Gaussian processes is often taken for granted. This is for a good reason: the default kernels implemented in packages and recommended in papers and tutorials are typically universal and characteristic at minimum (such as the Gaussian kernel) and often metrize weak convergence as well (such as the Mat ern and inverse multiquadric kernels) (Sriperumbudur et al., 2010, 2011; Balandat et al., 2020). We have sought to provide a similarly stable foundation for kernel methods on biological sequence data. We have considered a variety of diﬀerent biological notions of sequence similarity, but in some settings, one may want others. Sequences can be joined at the end, as with cyclic peptides and circular RNA, or joined into large protein complexes, made up of many separate polypeptide chains. They can possess symmetries, like repeats, and they can be mutated in complex ways, such as in immune receptor development. One direction for future work is to construct new kernels with discrete masses that are suited to these or other biological phenomena. The tools we have developed for proving kernels have discrete masses (Section 5) provide a starting point. We may also use these tools to better understand and improve deep learning methods (Alipanahi et al., 2015). On the one hand, kernels may be used as a lens for understanding the behavior of deep learning methods, by analyzing their neural tangent kernels (Jacot et al., 2018). On the other hand, kernels may inspire new neural network architectures with appropriate inductive biases (Lei et al., 2017; Chen et al., 2019). Another direction is to develop methods to automatically learn complex kernels from data, while preserving the discrete mass property. Our proofs in Sections 6-9 demonstrate how, starting with a simple kernel with discrete masses, we can apply a series of discrete mass-preserving transformations (Section 5.2) to develop a more complicated kernel with discrete masses. A possible approach to learning new kernels is to optimize this series of transformations. Broadly, our work contributes to the small but growing literature on the theoretical foundations of machine learning for biological sequences. Machine learning methods for biological sequences are of wide and increasing importance to human health and the environment, and are often applied to problems where trust and reliability are essential. However, they are sometimes mistakenly thought of as mere special cases of general purpose methods, invented for other types of data. Our results instead encourage further investigation into the speciﬁc theory of biological sequence analysis.

Kernels for Biological Sequences

Appendix A. Illustrative example

We illustrate the practical use of our theoretical results via a toy example. We consider the problem of sequence regression, where the aim is to predict a sequence s phenotype, e.g. ﬂuorescent color, enzymatic activity, binding strength, etc. We show that kernel regression using a classic biological sequence kernel, a Hamming kernel of lag L (which is an instance of a weighted degree kernel) can fail entirely to ﬁt the true sequence-to-phenotype map. We then introduce a small but non-obvious modiﬁcation to the kernel, which our theory guarantees will allow kernel regression to succeed. Hamming kernels compare sequences based on a sliding window, counting the number of times the subsequence (kmer) in the window matches exactly (Schweikert et al., 2008). More precisely, we consider a Hamming kernel that uses features Φl,V (X) = 1(X(l:l+L) = V ) which indicate if the L-mer at position l of X is V . The kernel is then k H(X, Y ) = (Φ(X)|Φ(Y )) = P l=0 P V BL Φl,V (X)Φl,V (Y ) for sequences X, Y , i.e. it counts the number of L-mers found in the same position in X and Y . Note that in the case where L = 1, the representation Φ(X) reduces to the extremely popular one-hot-encoding representation of X, and the kernel takes the form k H(X, Y ) = |X| |Y | d H(X, Y ), where d H is the Hamming distance. This kernel, however, might not be able to describe the true relationship between sequences and the phenotype that the scientist is interested in. The reason is that the kernel is not universal, and so has limited ﬂexibility. To see this, we will show that functions f in Hk are restricted to only take certain values. In particular, the values that a function f takes on set of sequences A1 can be restrained by the values it takes on a subset A2 A1. The problem arises as soon as the length of sequences in the data set exceed the length of the sliding window in the kernel, L. Deﬁne A1 = {X}|X|=L+1 to be the set of all sequences of length L + 1. Deﬁne A2 = {X + X(0)}|X|=L to be the set of those sequences of length L + 1 that end with the same letter they started with (namely, X(0)). Now, when performing kernel regression, we are ﬁtting a function f = PN n=1 αnk H(Yn, ). We ﬁnd that f must satisfy the constraint that its average value on A1 equals its average value on A2,

X A1 f(X) =

Φ(Yn) 1 |A1|

Φ(Yn) 1 |A2|

The reason is that for every L-mer V BL and starting position l {0, 1}, the proportion of sequences that have V at position l is the same in A2 as it is in A1, namely |B| L. Thus, 1 |A1| P X A1 Φ(X)l,V = |B| L = 1 |A2| P X A2 Φ(X)l,V for all l, V , yielding Eqn. 8. The implication of Eqn. 8 is that functions in the RKHS are constrained, and cannot describe any possible relationship between sequences and their phenotypes; in other words, it says that the kernel is not universal. Note that this argument also generalizes to more complex versions of the Hamming kernel, such as those that consider all kmers up to length L, or those that score kmer similarity based on the biophysical properties of amino acids instead of exact matches (Schweikert et al., 2008; Toussaint et al., 2010).

Amin, Marks, and Weinstein

The limited ﬂexibility of the Hamming kernel can lead to serious practical failures, where the model is not just slightly wrong on some data points but instead makes terrible predictions for all data points. Say a scientist has taken measurements of all sequences of length L + 1, obtaining a data set {g(X) : X A1}. They ﬁnd that sequences X A2 that start and end with the same letter have a score of g(X) = |B| 1, and those that do not (that is, X A1 \ A2) have a score of g(X) = 1. We assume there is no measurement error, i.e. g(X) is the ground truth value of the phenotype. The least-squares ﬁt for this big, clean data set, using kernel regression with the Hamming kernel, turns out to be awful: f(X) = 0 for all X. To see this, note that the least squares objective is minimized at α1, . . . , αN = 0,

1 2(αk H(Y, X) g(X))2 α=0 = X

X A1 k H(Y, X)g(X)

X A1 k H(Y, X) |B| X

X A2 k H(Y, X)

X A1 k H(Y, X) 1 |A2|

X A2 k H(Y, X)

The lack of kernel ﬂexibility is thus a serious practical concern, as it can lead to terrible model performance even when we have large amounts of clean data. Our theoretical results in Section 6 will provide a straightforward solution, in the form of a modiﬁed kernel: the inverse-multiquadratic Hamming (IMQ-H) kernel,

k IMQ H(X, Y ) = 1 (1 + |X| |Y | (Φ(X)|Φ(Y )))2 .

This kernel is just as tractable to compute as the original Hamming kernel, k H(X, Y ) = (Φ(X)|Φ(Y )). Like the original kernel, it treats sequences that have the same kmers in the same positions as similar to one another. Thus, if there is a biological reason to use a Hamming kernel for a speciﬁc problem, the same justiﬁcation likely holds for a IMQ-H.5

However, we will prove that the IMQ-H has a fundamental advantage: it has discrete masses, which means that it is universal. If we perform a regression using the IMQ-H instead of the original Hamming kernel, we are guaranteed to be able to describe any sequence-to-property relationship. We illustrate the beneﬁts of our IMQ-H kernel in simulation. As the ground truth sequence-to-phenotype map, we consider

g(X) = max b

l=0 1(Xl = b).

5. For instance, if we are examining promoters, whose biological activity depends on the arrangement of transcription factor binding sites, we have good biological reasons to believe that sequences with the same kmers in the same positions will have similar function, as transcription factors typically bind conserved sites of roughly 8 10 nucleotides. In this context, it makes sense to use (Φ(X)|Φ(Y )) as a similarity score, but the biology oﬀers no particular reason to prefer k H over k IMQ H.

Kernels for Biological Sequences

which counts the number of occurrences of the most common letter in the sequence. We draw data points X1, X2, . . . from a uniform distribution on DNA sequences of length 4, and record noiseless observations of g(Xn) for each. We estimate g using least-squares kernel regression, and compare two diﬀerent kernels: the Hamming kernel with L = 2 and the IMQ-H kernel with L = 1. Naively, one might expect the IMQ-H kernel to perform more poorly, as it uses a smaller sliding window L. In fact, however, we ﬁnd that regression with the Hamming kernel asymptotes at a high error as we collect more and more data, while with the IMQ-H the error converges to zero (Figure 1). Our theoretical guarantees can thus lead to dramatic performance gains at negligible computational cost.

Appendix B. A universal kernel that metrizes the space of distributions but lacks discrete masses

Here we show that having discrete masses is not strictly necessary in order to have a kernel that is universal or metrizes P(S). Recall Example 2, where the kernel failed to metrize P(S) because k A could be well approximated by kn A in Hk. To obtain a universal kernel that metrizes P(S) but lacks discrete masses we will create a kernel on the integers {0, 1, 1, 2, 1, . . . } such that k0 can be approximated by R dνn(X)k X for νn P(S) but only if the total variation of νn blows up with n. Let Hk be a Hilbert space with an orthonormal basis indexed by the integers {e0, e1, e 1, . . . }. Deﬁne k0 = e0 and for any positive n, kn = en and k n = p

1 ϵ2n ϵ4nen + ϵ2 ne n + ϵne0 for a decreasing sequence of positive ϵn that has ϵn 0. Clearly k is a strictly positive deﬁnite kernel on the integers and k(z, z) = 1 for all integers z. Hk cannot contain δ0. To see this, say that δ0 Hk. Then

ϵn (δ0( n) δ0(n))

1 ϵn (k n kn)

1 ϵ2n ϵ4n 1

ϵn en + ϵne n + e0

k (δ0|e0)k = δ0(0) = 1,

a contradiction. However, k is C0-universal. To see this, say for some ﬁnite (possibly signed) measures on S, µ, ν, we have P z ν(z)kz = P z µ(z)kz. Call π = ν µ. Note 0 = (P z π(z)kz|e n)k = ϵ2 nπ( n) and 0 = (P z π(z)kz|en)k = π(n) + π( n) p

1 ϵ2n ϵ4n. Thus π(n) = 0 for all n = 0. Finally, 0 = (P z π(z)kz|e0)k = π(0) so π = 0, i.e. µ = ν. By Proposition 2 of Sriperumbudur et al. (2011) k is universal. k also metrizes P(S). To see this, say for some µ, ν1, ν2, P(S), we have P z νm(z)kz P z µ(z)kz. Call πm = νm µ. As above, (P z πm(z)kz|e n)k = ϵ2 nπm( n) and (P z πm(z)kz|en)k = πm(n) + πm( n) p

1 ϵ2n ϵ4n. For each n, the left hand sides of both of these equations go to 0 as m so, πm(n) 0 for all n = 0. Finally note X

k = πm(0) +

n=1 πm( n)ϵn.

Amin, Marks, and Weinstein

The left hand side goes to 0 and since P z |πm(z)| 2 and ϵn 0, the sum in the right hand side also goes to 0. Thus πm(0) 0 and so νn µ.

Interestingly, although δ0 / Hk, k is still able to approximate δ0 arbitrarily well in inﬁnity norm. First pick, for each n, an fn span{k0, kn, k n} such that (fn|k0) = (fn|kn) = 0 and (fn|k n) = 1. Then gn = k0 PN n=1 ϵnfn is k0 on {0, N + 1, (N + 1), N + 2, (N + 2), . . . } and 0 on {1, 1, . . . , N, N}. As k0 vanishes at inﬁnity, gn converges to δ0 in inﬁnity norm as N . However, despite approximating δ0 in inﬁnity norm, (gn)n does not approximate δ0 in Hk: gn k grows unboundedly.6

Appendix C. Discrete masses in the alignment kernel

In this section we prove the alignment kernel has discrete masses under certain parameter settings.

We ﬁrst establish some properties of sequences with encodings that are not one-hot. In particular, we show that convolution of kernels works in the same way when we reparameterize the alphabet. Note, for V, W L=0RL B, we write V(l), V(l:l ) to index positions along the ﬁrst dimension and V +W for concatenation along the ﬁrst dimension, in analogy to the deﬁnitions for regular sequences in S = L=1BL.

Proposition 32 Say k1, k2 are kernels on S and V RL B, V RL B.

k1 k2(V, V ) = X

l1+l2=L,l 1+l 2=L k1 V(:l1), V (:l 1) k2 V(l1:), V (l 1:) .

6. To see why gn k must blow up, note that if this is not the case, then by the Banach-Alaoglu theorem a subsequence (gnk)k must weakly converge to a g Hk, that is, gnk(z) = (gnk|kz)k (g|kz)k = g(z) for all z. Thus, δ0 = g Hk, a contradiction.

Kernels for Biological Sequences

Proof k1 k2(V, V )

l=0 Vl,X(l)

l=0 Vl,Y(l)

k1(X(:l1), Y(:l 1))k2(X(l1:), Y(l 1:))

l=0 Vl,X(l)

l=0 Vl,Y(l)

k1(X(:l1), Y(:l 1))

l=l1 Vl,X(l)

k2(X(l1:), Y(l 1:))

l=0 Vl,X(l)

l=0 Vl,Y(l)

l=0 Vl+l1,X(l)

l=0 Vl+l 1,Y(l)

k1 V(:l1), V (:l 1) k2 V(l1:), V (l 1:) .

We now prove Theorem 21.

Theorem 33 (proof of Theorem 21) Deﬁne K as the matrix with Kb ,b = ks(b, b ) for b, b B, and σ = 1T BK 11B. If µ > 0, then k has discrete masses if and only if 2µ log σ. If µ = 0, then k has discrete masses if and only if 2µ > log σ.

Proof The proof is organized as follows. First, we reparameterize B such that Hk decomposes into a set of orthogonal hyperplanes WV . Second, we show that each of these hyperplanes is isomorphic to a tensor product of RKHSs of the alignment kernel with |B| = 1. Finally, the result follows from Theorem 39, which establishes conditions under which alignment kernels with |B| = 1 have discrete masses. Part 1: We start by constructing the reparameterized alphabet. Call U = K 11B, where 1B RB is the vector with 1 in each position, and let U = U/ks( U, U), so, UT 1B = UT K U/ UT K U = 1 and ks(U, U) = ( UT K U) 1 = ks( U, U) 1 = (1T BK 11B) 1 = σ 1. Let B1, . . . , B|B| 1 RB be chosen so that {U, B1, . . . , B|B| 1} is an orthogonal basis of the vector space RB with the dot product (v|x) = v T Kw, with (Bi|Bi) = 1 for all i. Thus, ks(Bi, Bj) = δi(j) for all i, j and ks(U, Bi) = 0 for all i. Now call B = {U, B1, . . . , B|B| 1}. We will use B as a reparameterized alphabet and note that by Proposition 17 the alignment kernel has discrete masses if and only if it has discrete masses when B is replaced with B. Now, for any number L and V BL, the insertion kernel k I applied to V is,

l=0 Vl,X(l)

e µ Lµg = e µ Lµ L 1 Y

Amin, Marks, and Weinstein

Note for any B {B1, . . . , B|B| 1} B, we have P b B Bb = BT 1B = BT K U = ks(B, U)/ks(U, U) = δU(B). Thus, k I,V = 0 if any letters in V are in B \ {U}. Now say V BL, V BL . By Proposition 32, k(V, V ) is given by

l N,V (1)+ +V (2l+1)=V,V (1)+ +V (2l+1)=V k I(V (1), V (1))

i=1 ks(V (2i), V (2i))k I(V (2i+1), V (2i+1)).

Each term in the sum will be non-zero only if V (m) and V (m) are composed only of the letter U for all odd m, and V (m) = V (m) and |V (m)| = 1 for all even m. In particular, if V ( B \ {U})L, sequences V that satisfy k(V, V ) = 0 must take the form

(t0 U) + V(0) + (t1 U) + V(1) + + V(L 1) + (t L U)

for some integers t0, . . . , t L. Call this sequence V [t]. Let WV S denote the set of all such sequences, i.e. WV = {V [t] | t0, . . . , t N N}. We see that WV is orthogonal to any WV for which V / WV . Each WV thus deﬁnes an orthogonal hyperplane. Moreover, the union of all {WV }V is the set of all sequences with letters in B. Proposition 14 therefore says that if the alignment kernel has discrete masses when restricted to each WV , it has discrete masses. Part 2: In this section we ﬁnd that, for the alignment kernel to have discrete masses over WV , it is suﬃcient that a simpler alignment kernel, with an alphabet size of just one, have discrete masses. Say V [t], V [t ] WV for some V ( B \ {U})L. Let V [t1:] denote V [t] with the ﬁrst t0 + 1 letters dropped. Since the only alignments that have positive weight are those that align the i-th set of Us in V [t] to the i-th set of Us in V [t ] for all i, we can decompose the alignment kernel into a product of alignment kernels comparing each of the strings of Us:

k(V [t], V [t ]) = X

l N Y (1)+ +Y (2l+1)=t0 U Z(1)+ +Z(2l+1)=t 0 U

l N Y (1)+ + Y (2l +1)=V [t1:] Z(1)+ + Z(2l +1)=V [t 1:]

k I(Y (1), Z(1))

i=1 ks(Y (2i), Z(2i))k I(Y (2i + 1), Z(2i+1))

ks(V(0), V(0))

k I( Y (1), Z(1))

i=1 ks( Y (2i), Z(2i))k I( Y (2i+1), Z(2i+1))

=k(t0 U, t 0 U)k(V [t1:], V [t 1:])

l=0 k(tl U, t l U).

Now call k1 the alignment kernel when B = {A}, with parameter ks(A, A) = σ 1. Since ks(U, U) = σ 1, we have k1(l A, l A) = k(l U, l U). Thus k on WV acts as the

Kernels for Biological Sequences

L+1-times tensor product of k1. Thus, by Proposition 15, if k1 has discrete masses, so does k on WV for every V , so that k has discrete masses on S; on the other hand, if k1 does not have discrete masses, neither does k on W , so that k does not have discrete masses on S. Finally, in Theorem 39 we will see that if |B| = 1 then if µ > 0, k1 has discrete masses if and only if 2µ + log k(A, A) 0. Thus if µ > 0, we ﬁnd that k1, and therefore k, has discrete masses if and only if 2µ log σ. The result similarly follows for µ = 0.

Note that this proof also extends to alignment kernels that have diﬀerent penalties for inserting diﬀerent letters. More precisely, we can modify k I such that it diﬀers for diﬀerent letters by using g(X) = Q|X| 1 l=0 t(X(l)) g(X) for a function t RB. Above we chose

t = 1B. For general t deﬁne U = K 1t/t T K 1t and deﬁne B = {U, B1, . . . , B|B| 1} the same way. Again, for any B B, P b t(b)Bb = ks(B, U)/ks(U, U) = δU(B), so, k I,V = 0 for any sequence V that has any letters orthogonal to U. Calling σ = t T K 1t, we can repeat the proof of theorem 21 to reach the same conclusion.

Appendix D. Interpreting the conditions for discrete masses in alignment kernels

In Theorem 21 we saw that the alignment kernel has discrete masses only under certain conditions. An interesting aspect of these conditions is that they involve an interplay between the the kernel s insertion parameters (through µ) and the substitution parameters (through σ). Often these parameters are thought of as entirely distinct; after all, biologically, substitution and insertion mutations can stem from quite distinct molecular processes. Here we build some intuition for their interplay in determining discrete masses. We focus on the special case where µ = 0. The key idea is to rewrite the alignment kernel as a sum over a particular positionwise comparison kernel applied to each pairwise alignment; then, the conditions for the alignment kernel to have discrete masses turn out to be the same as the conditions for the corresponding position-wise kernel to have discrete masses. In detail, recall that Equation 6 decomposes the alignment kernel into a sum over pairwise alignments. For example, consider the alignment between X = ATGC and Y = ACC we saw in Section 7.1. We can rewrite the corresponding term in Equation 6 in the form

k s(A, A)k s(T, C), k s(G, -)k s(C, C)k s(-, -)k s(-, -) . . .

where we deﬁne the extended letter comparison kernel k s as k s(b, b ) = ks(b, b ) for b, b B, k s(b, -) = e µ for b B, and k s(-, -) = 1. This matches the form of a position-wise comparison kernel (Def. 18) between the sequences ATGC--... and AC-C--..., with letter kernel k s, and where the gap symbol is also the stop symbol. In general, every term in the sum making up the alignment kernel (Equation 6) can be written in this way, with the same letter kernel. Applying Theorem 19, the position-wise kernel has discrete masses if (and only if) k s is strictly positive deﬁnite.7 This, in turn, depends on the relationship between the

7. The only if direction is straightforward: if k s is not strictly positive deﬁnite, Theorem 11 is violated.

Amin, Marks, and Weinstein

substitution and insertion parameters, since both go into k s. Let V RB {-} such that V- = 1. Call V = (Vb)b B the components of V in B. Deﬁne U = K 11B so UK U = σ. Now, k s(V, V ) = V T K V + 2e µ V T 1B + 1 = V T K V + 2e µ V T K U + 1.

Now decompose V as V = α U + V where the columns of V are orthogonal to those of U under the inner product (v|w) = v T Kw. Then,

k s(V, V ) = V T K V + α2 UK U + 2αe µ UK U + 1 = V T K V + σ α2 + 2e µα + 1.

This expression is minimized at α = e µ, V = 0, in which case

k s(V, V ) = σe 2µ + 1.

Thus k s is strictly positive deﬁnite if and only if e 2µσ < 1, i.e. 2µ > log σ. Therefore, the alignment kernel has discrete masses if and only if the position-wise comparison kernel has discrete masses. At bottom, then, the ﬂexibility of the alignment kernel depends on the ﬂexibility of the kernel we use for comparing aligned sequences, and this in turn depends on the ﬂexibility of the kernel we use for comparing letters both real letters (nucleotides/amino acids) and gaps. The interplay between substitution and insertion parameters stems from the fact that we have in essence added the gap symbol to the alphabet, and need the letter kernel over this extended alphabet to still be strictly positive-deﬁnite.

Appendix E. Features of the alignment kernel

In this section, we derive a representation of the alignment kernel in terms of features. These features are gapped kmer counts.

Theorem 34 (proof of Theorem 24) Consider an alignment kernel k with ks(b, b ) = σ 1|B|δb (b) for b, b B. Tilting the kernel by A(X) = eµ|X| gives k A(X, Y ) = A(X)A(Y )k(X, Y ). For all sequences X S, deﬁne the features {u V (X)}V S, given by

u V (X) = e 1 2 ζ|V | X

J G(|X|,g) 1(X(J) = V ), (11)

where ζ = 2µ log σ + log |B|. Now we show that {u V (X)}V S is an orthonormal basis for Hk A.

Proof We can represent alignments between two sequences X S and Y S by pairs of indices J G(|X|, M), J G(|Y |, M) that describe which letters in X and Y are aligned. In particular, k(X, Y ) is equal to

(jl)M l= 1 G(|X|,M)

(j l)M l= 1 G(|Y |,M) ki(X(:j0), Y(:j 0))

m=0 ks(X(jm), Y(j m))ki(X(jm+1:jm+1), Y(j m+1:j m+1))

Kernels for Biological Sequences

By our assumption on ks, for each alignment J G(|X|, M, g), J G(|Y |, M, g ), the contribution from the aligned letters, QM 1 m=0 ks(X(jm), Y(j m)), is 0 if X(J) = Y(J ) and is σ M|B|M otherwise. On the other hand, the contribution from the insertions and deletions is QM 1 m=0 k I(X(jm:jm+1), Y(j m+1:j m+1)) = e µ(|X| M+|Y | M)e µ(g+g ). Thus, k A(X, Y ) is equal to

e2µσ 1|B| M X

g,g e µ(g+g ) X

J G(|X|,M,g)

J G(|Y |,M,g ) 1(X(J) = Y(J ))

g,g e µ(g+g ) X

J G(|X|,M,g)

J G(|Y |,M,g ) 1(X(J) = V )1(Y(J ) = V )

V BM u V (X)u V (Y )

Thus the features of k A are X 7 (u V (X))V S. Now we will show that u V Hk for all V S and that (u V )V make up an orthonormal basis. Deﬁne the inﬁnite matrix Q as QV,Z = e 1

2 ζ|V |u V (Z) for all V, Z S. Pick some total ordering, , of S such that V Z if |V | |Z|. In this case, Q is an inﬁnite upper triangular matrix with 1 s along the diagonal. Thus Q has an inﬁnite upper triangular matrix Q 1 as its inverse with 1s along its diagonal. In the remainder of the proof, we will represent any P Z αZk A Z span{k A Z}Z S as an inﬁnite vector, indexed by elements of S, with αZ in position Z. So, for any function f span{k A Z}Z S, we can think of the matrix-vector product Qf as the featurization of f. Now, for any f, g span{k A Z}Z S, we have from Equation 12, (f|g)k A = f T QT diag(eζ|V |)V SQg,

where diag(eζ|V |)V S denotes the inﬁnite diagonal matrix with entries eζ|V |. We will next show that u X Hk A, with the representation f X = e 1

2 ζ|X| P Z Q 1 Z,Xk Z span{k A Z}Z S (note the sum has only ﬁnitely many non-zero terms as Q 1 Z,X = 0 for only ﬁnitely many Z with Z X). We have,

diag(eζ|V |)V SQf X = e 1 2 ζ|X|d X,

where d X is the inﬁnite-dimensional vector indexed by S with 0 s in every position except 1 in position X. For any Y S, (f X|k A Y )k A = e 1 2 ζ|X|QX,Y = u X(Y ). Thus, u X Hk A. We also have, (u X|u Y )k A = f T XQT diag(eζ|V |)V SQf Y = d T Xd Y = d X(Y ).

Thus {u X}X S is a set of orthogonal non-zero vectors, normalized such that u X k A = 1 for all X S. Finally, by Equation 12,

k A(X, Y ) =

V S (u V |k A X)k A(u V |k A Y )k A

for all X, Y S so that {u X}X S is also a basis.

Amin, Marks, and Weinstein

We now consider the general case, allowing ks to be any strictly positive deﬁnite kernel on letters, instead of requiring that it is diagonal. The ﬁrst step is to show that any alignment kernel can be written as a reparameterization of an alignment kernel with diagonal ks.

Proposition 35 Consider an alignment kernel k, with parameters µ, µ and ks. Let Kb,b = ks(b, b ) for b, b B and σ = 1T BK 11B. There is an alphabet B, inducing the set of sequences S, such that k applied to S is an alignment kernel with parameters µ, µ and ks( b, b ) = σ 1|B|δ b( b ) for all b, b B. As well, for any b B, P b B ks( b, b) = σ 1|B|. We call B a rectiﬁed alphabet for k and S a rectiﬁed set of sequences for k.

Proof Let B1 = K 11B/ σ so ks(B1, B1) = 1 and let {B2, . . . , B|B|} be an orthonormal set orthogonal to B1 in RB with inner product ks. Let Φ R|B| |B| be a unitary matrix with |B| 1/2 in every entry of its ﬁrst column and deﬁne the letters Ci = p

σ 1|B| P|B| j=1 Φi,j Bj. Thus we have that ks(Ci, Cj) = σ 1|B|(Φi, |Φj, ) = σ 1|B|δi(j) and P b B Ci,b = CT i 1B = σCT i KB1 = p

|B| P|B| j=1 Φi,jks(Bi, B1) = p

|B|Φi,1 = 1. Call B = {Ci}i and S the set of sequences with letters in B. By Proposition 32, for V, V S, k(V, V ) is given by

l N,V (1)+ +V (2l+1)=V,V (1)+ +V (2l+1)=V k I(V (1), V (1))

i=1 ks(V (2i), V (2i))k I(V (2i+1), V (2i+1)).

For b, b B, ks( b, b ) = σ 1|B|δ b( b ). As well, for any number L and V BL, the insertion kernel k I applied to V is,

l=0 Vl,X(l)

e µ Lµg = e µ Lµ L 1 Y

g = e µ Lµg.

Thus, for V, V S with |V |, |V | > 0,

k I(V, V ) = e µ |V |µe µ |V |µ.

Thus k on S is the alignment kernel with the same parameters µ, µ and with ks( b, b ) = σ 1|B|δ b( b ) for b, b B. Finally, note, for any b B, X

i ks(Ci, b) = p

i,j Φi,jks(Bj, b)

1T |B|Φ ,j ks(Bj, b)

|B|δ1(j)ks(Bj, b)

σ 1|B|ks(B1, b)

Kernels for Biological Sequences

We can now describe the features of an alignment kernel with arbitrary ks. When ks was diagonal, each feature of a sequence X depended on the number of positions at which a gapped kmer matched X exactly. Now, instead of demanding an exact match, we score each match according to ks.

Corollary 36 Consider an alignment kernel k and let S be a rectiﬁed set of sequences. For Y S, V S with L = |Y | = |V |, deﬁne the pairwise comparison using ks as,

cpw(Y, V ) =

l=0 σ|B| 1ks(Y(l), V(l)).

Tilting the kernel by A(X) = eµ|X| gives k A(X, Y ) = A(X)A(Y )k(X, Y ). For all sequences X S, deﬁne the features {u V (X)}V S, given by

u V (X) = e 1 2 ζ|V | X

J G(|X|,g) cpw(X(J), V ), (13)

where ζ = 2µ log σ + log |B|. Now, {u V (X)}V S is an orthonormal basis for Hk A.

Proof Let B, S a rectiﬁed alphabet and a rectiﬁed set of sequences for k. By Theorem 24, there is an orthonormal basis of Hk A, {u V }V S such that for any V, X S

(u V |k A X)k A = e 1 2 ζ|V | X

J G(|X|,g) 1(X(J) = V ).

Note that reparameterization does not aﬀect the RKHS Hk A, so that {u V }V S is an orthonormal basis for the kernel k A applied to the space S as well.

Now, consider a Y S, with L = |Y |. Since B is a basis for RB, we can write each letter Y(l) as Y(l) = P b B Y(l), b b for some scalar coeﬃcients {Y(l), b} b B (to be clear, we are treating Y(l) here as a one-hot encoding of the lth letter of Y ). To derive the coeﬃcients Y(l), b, recall

σ|B| 1 b} b B is an orthonormal basis of RB, with inner product deﬁned by ks. Thus, we have Y(l), b = ks(Y(l), p

σ|B| 1 b) p

σ|B| 1 = σ|B| 1ks(Y(l), b) for all b B. We also have, applying the properties of the rectiﬁed alphabet in Proposition 35, P b B Y(l), b =

Amin, Marks, and Weinstein

σ|B| 1 P b B ks(Y(l), b) = 1. Thus,

(u V |k A Y )A k = X

X S : |X|=L

l=0 Y(l),X(l)

e 1 2 ζ|V | X

J G(|X|,g) 1(X(J) = V )

=e 1 2 ζ|V | X

X S : |X|=L

l=0 Y(l),X(l)

m=0 1(X(jm) = V(m))

=e 1 2 ζ|V | X

m=0 Y(jm),V(m)

=e 1 2 ζ|V | X

J G(|X|,g) 1

m=0 σ|B| 1ks(Y(jm), V(m))

=e 1 2 ζ|V | X

J G(|X|,g) cpw(Y(J), V )

Appendix F. Alignment kernels are universal and characteristic for certain tiltings

In this section we show that all alignment kernels are, under an appropriate tilting, universal and characteristic even if they do not have discrete masses. Note that while a kernel with discrete masses maintains discrete masses after any tilting, the same is not true for universal or characteristic kernels; here, the tilting matters. Functions in C0 vanish at inﬁnity, so to be sure the alignment kernel can approximate any C0 function, even when the kernel does not have discrete masses, the tilting will have to be suﬃciently small. The normalizing tilting A(X) = p

k(X, X) 1 is not necessarily small enough. Instead, one would in general need to tilt so much that k A(X, X) 0 as |X| . From a modeling perspective, this is problematic: if f is drawn from a Gaussian process with kernel k A, then its variance will go to zero as the length of sequences increases: Var(f(X)) = k A(X, X) 0 as |X| . Thus, on longer sequences, the value of f will be determined essentially just by the prior. So, although we can force alignment kernels that lack discrete masses to be universal if we tilt them enough, the tilted kernels will generalize poorly.

Proposition 37 Say k is an alignment kernel and let k be an alignment kernel with the same values of µ, µ as k but with ks(b, b ) = σe 2µ+ϵks(b, b ) for some ϵ > 0. Choose an A : S (0, ) such that k A is a C0-kernel, meaning Y 7 A(Y )A(X) k(X, Y ) is in C0(S) for every X and sup X S A(X)A(X) k(X, X) < . Then the tilted alignment kernel k A is C0-universal and characteristic.

Proof Assume throughout that 2µ log σ, otherwise the result follows from Theorem 21 and Corollary 16. Note k(X, Y ) k(X, Y ) for all X, Y S. Thus, k A is a C0-kernel. Also

Kernels for Biological Sequences

note that, deﬁning Kb,b = ks(b, b ), we have log(1B K 11B) = 2µ ϵ < 2µ, so k has discrete masses by Theorem 21. We now derive an orthogonal basis for H k A. Let A(X) = exp( µ|X|) and let S be a rectiﬁed set of sequences for k A. Now, by Corollary 36 , {A Au V }V S is an orthonormal basis of Hk A. Note S is also a rectiﬁed set of sequences for k A, so let {A A u V }V S be the orthonormal basis of H k A from Corollary 36. Note cpw from Corollary 36 is identical for k A

and k A so that each u V is a scaled version of u V ; in particular, {A Au V }V S is an orthogonal basis for H k A. Since k has discrete masses, it is universal, so for any f C0(S) and ϵ > 0 we can pick

a g H k such that f g < ϵ. Deﬁne M = sup X S q

A(X)A(X) k(X, X) < and

choose a g span{ k A X}X such that g g k A ϵ/M. Now, g C0(S) and

|g (X) g(X)| = |( k A X|g g) k A| k A X k A g g k A Mϵ/M = ϵ

for all X S. Thus, g f 2ϵ. Thus, every function in H k A can be approximated in C0 by a function in span{ k A X}X. Thus, span{ k A X}X is dense in C0(S). Notice the ﬁnite dimensional vector spaces span{ k A X}X S : |X| L and span{A Au X}X S : |X| L are identical for all L (recall from the proof of Proposition 17 that reparameterizing the alphabet preserves the span of the kernel embedding vectors). As a result, span{ k A X}X S = span{A Au X}X S. But this latter vector space is similarly equal to span{k A X}X. Thus, span{k A X}X = span{ k A X}X, so span{k A X}X is dense in C0(S). k A is thus universal. Further, by proposition 2 of Sriperumbudur et al. (2011), k A is also characteristic.

To interpret this result, and the tiltings it requires, note that we must choose a tilting A such that k A(X, X) is bounded. Roughly speaking, we expect k(X,X) k(X,X) to grow exponentially with the length |X|, since each match in each pairwise alignment has been up-weighted by a factor of σe 2µ+ϵ > 1. The normalizing tilting A(X) = p

k(X, X) 1 would help us bound k A(X, X), but to bound k A(X, X) we need to go further and exponentially down-weight longer sequences. This means choosing a tilting A such that k A(X, X) 0 as |X| 0.

Appendix G. Discrete masses in the one-letter alignment kernel

In this section we show that alignment kernels have discrete masses under certain parameter settings when the alphabet size is one. This result is crucial for proving the more general result, covering arbitrary alphabet sizes, in Theorem 21 and Appendix C. It builds on the feature representation described in Theorem 24 and Appendix E. The result will depend on the theory of the Riordan group, for which we will need to introduce some notation. If f(x) = P i=0 Cixi is a formal power series, deﬁne [xn]f = Cn. For two formal power series f(y), g(y) we deﬁne the Riordan array of f and g as the inﬁnite upper triangular matrix with entries

QL ,L = [x L y L] g(y) 1 xyf(y) = [x L y L]

l=0 g(y)xlylf(y)l = [y L]y L g(y)f(y)L

Amin, Marks, and Weinstein

for numbers L, L 0. We say that Q is in the Bell subgroup if f = g. The result we will need is that one can invert a Riordan array by inverting the power series f and g (Shapiro et al., 1991).

Proposition 38 (Shapiro et al., 1991) Let

QL ,L = [x L y L] g(y) 1 xyf(y)

and let f be the unique power series such that the equation y f(y)f(y f(y)) = y holds.8 Then, letting g(y) = g(y f(y)) 1 ,

Q 1 L ,L = [x L y L] g(y) 1 xy f(y).

Note if Q is in the Bell subgroup, g(y) = f(y) and f(y)f(y f(y)) = 1 so g(y) = f(y). Thus Q 1 is also in the Bell subgroup.

l=0 QL,l Q 1 l,L =

l=0 [yl]g(y) (yf(y))L [z L ] g(z) z f(z) l

=[z L ] g(z)

z f(z) l [yl]g(y) (yf(y))L

=[z L ] g(z)g(z f(z)) z f(z)f(z f(z)) L

=[z L ]1 z L = δL (L).

Theorem 39 Say |B| = 1 and call A the only letter in B. If µ > 0, k has discrete masses if and only if 2µ + log ks(A, A) 0. If µ = 0, k has discrete masses if and only if 2µ + log ks(A, A) > 0.

Proof Note in this case ζ = 2µ + log ks(A, A). We will replace k with k A(X, Y ) = eµ|X|eµ|Y |k(X, Y ) throughout. Clearly, this does not aﬀect whether or not the kernel has discrete masses. Index S = {0, 1, . . . }. Call Q the inﬁnite upper triangular matrix with e 1

2 ζL u L (L) = P g=0 e g µ|G(L, L , g)| in position (L , L). Note that the Gram matrix of k on {0, 1, . . . L} is QT :L+1,:L+1diag(eζL )L L =0Q:L+1,:L+1. Since Q is inﬁnite upper triangular with no zeros along its diagonal it has an inﬁnite upper triangular inverse Q 1 with the property that (Q:L,:L) 1 = (Q 1):L,:L. Thus the inverse of the Gram matrix of k on {0, 1, . . . L} is Q 1T :L+1,:L+1diag(e ζL )L L =0Q 1 :L+1,:L+1 which has, in position (L , L ), PL l=L e ζl(Q 1 L ,l)2. Thus, by Proposition 11, δL Hk if and only if

l=L e ζl(Q 1 L ,l)2 < . (14)

8. Another way to write this condition is that tf(t) = y when t = y f(y) so that yf is the compositional inverse of y f.

Kernels for Biological Sequences

The proof will now proceed in two parts using the technique of generating functions and the theory of the Riordan group. In part 1 we will equate u L(L ) with the coeﬃcients of a formal power series. We will then explicitly invert the matrix Q by inverting this power series. In part 2 we will equate Equation 14 with the convergence of a power series, which we will use to prove the result. Part 1: Say g is an integer greater than or equal to 1, and L L . We want to ﬁnd the size of G(L, L , g). Each J = (j 1, . . . , j L ) G(L, L , g) is uniquely characterized by the numbers jm1, . . . , jmg, which indicate the positions at which each of the g gaps start, and jm1+1, . . . , jmg+1, which indicate the position at which each of the g gaps end. In particular,

J = ( 1, 0, 1, . . . , jm1 1, jm1, jm1+1, jm1+1 + 1, . . . , jm2 1, jm2, jm2+1, . . . , L).

Thus each J is also uniquely characterized by the sizes of its g gaps l1 = jm1+1 jm1 1, l2 = jm2+1 jm2 1, . . . as well as the sizes of its g + 1 contiguous regions c0 = jm1 + 1, c1 = jm2 jm1+1 + 1, c2 = jm3 jm2+1 + 1, . . . , cg = L jmg+1. On the other hand, given numbers l1, . . . , lg, c0, . . . , cg, one gets a unique valid J G(L, L , g) if and only if Pg i=0 ci = L , Pg i=1 li = L L , li 1 for all i, and ci 1 for all i = 1, 2, . . . , g 1 (c0 = 0 implies a gap at the beginning of the sequence while cg = 0 implies a gap at the end). The number of g + 1 ordered numbers (c0, . . . , cg) with Pg i=0 ci = L and ci 1 for all i = 1, 2, . . . , g 1 is the L -th coeﬃcient of the power series

(1 + x + x2 + . . . ) x + x2 + . . . g 1 (1 + x + x2 + . . . ) = xg 1

Similarly, the number of sets of number of g ordered numbers (l1, . . . , lg) with Pg i=1 li = L L and li 1 for all i is the L L -th coeﬃcient of the power series yg (1 y)g . On the other hand if g = 0, G(L, L , g) is empty if L = L otherwise it has one element. In particular, G(L, L , 0) is (trivially) the coeﬃcient in front of x L y L L of 1/(1 x). Thus, the size of G(L, L , g) is the coeﬃcient in front of x L y L L zg of the power series

g=1 zg xg 1

(1 x)g+1 yg

(1 y)g + 1 1 x = zy (1 x)2(1 y)

z x 1 x y 1 y

= zy (1 x)2(1 y) (1 x)(1 y) (1 x)(1 y) zxy + 1 1 x

zy (1 x)(1 y) zxy + 1

zy + 1 x y + xy zxy

(1 x)(1 y) zxy

(zy + 1 y)(1 x)

(1 x)(1 y) zxy

= 1 (1 z)y (1 x)(1 y) zxy

1 y 1 x1 (1 z)y

Amin, Marks, and Weinstein

Call fz(y) = 1 (1 z)y

1 y . Next we replace x with xy so that we get that |G(L, L , g)| is

[(xy)L y L L zg] fz(y) 1 (xy)fz(y) = [x L y Lzg] fz(y) 1 xyfz(y). Thus, substituting z = e µ and calling f = fe µ,

g=0 e g µ|G(L, L , g)| = [x L y L]

g=0 e g µ[zg] fz(y) 1 xyfz(y) = [x L y L] f(y) 1 xyf(y).

Q is in the Bell subgroup of Riordan arrays (Shapiro et al. (1991) and chapter 3 of Shapiro et al. (2022)). This means that by Proposition 38, deﬁning f as the unique power series with y f(y)f(y f(y)) = y,

Q 1 L ,L = [x L y L] f(y) 1 xy f(y).

Part 2: Since

Q 1 L ,L = [x L y L] f(y) 1 xy f(y) = [y L] f(y)(y f(y))L = [y L L ] f(y)L +1,

l=L e ζl(Q 1 L,l)2 = e ζL X

2 ζl[yl] f(y)L+1)2.

This sum is ﬁnite if e 1

2 ζ is smaller than the radius of convergence of f(y)L+1 (as the terms in the series decay exponentially) and inﬁnite if it is larger (as the terms do not approach 0). We will now show that the radius of convergence of f is exactly 1, so that if ζ < 0 then δ0 Hk and if ζ > 0 then, since the radius of convergence of f(y)L is also at least 1 for all L, k has discrete masses. We will afterwards carefully consider the case of ζ = 0. Let us ﬁrst consider the case µ = 0. In this case we have, from the deﬁnition of f, yf(y) = y 1 y, so y f(y) = y 1+y. In this case f clearly has radius of convergence 1. We can

also see explicitly that [yl] f(y)L+1 = ( 1)l L+l l . Thus Equation 14 also does not hold for any L when ζ = 0. We now assume µ > 0. We will derive the form of f. Call ξ = 1 e µ (0, 1), and t = y f(y).

tf(t) = y t(1 ξt) = (1 t)y 0 = ξt2 (y + 1)t + y.

Thus, picking the root that is 0 when y = 0, we get

(1 + y)2 4ξy2 .

The roots of the polynomial (1 + y)2 4ξy2 are

(2ξ 1)2 1 = (2ξ 1) i p

Note both of these roots have norm 1 and can thus be written eiθ, e iθ for some π > θ > 0. Thus, we may write

(1 eiθy)(1 e iθy) .

Kernels for Biological Sequences

1 e iθy each have radius of convergence 1, so, y f has radius of convergence at least 1. On the other hand, the derivative of y f diverges when y approaches eiθ or e iθ, so that y f, and thus f, has radius of convergence exactly 1. The case when ζ = 0 is more delicate. First call g(y) = p

(1 + y)2 4ξy2 = p

(1 eiθy)(1 e iθy). For real 0 < r < 1, deﬁne

q L(r) = Z 1

0 dtg( re2πit)Lg( re 2πit)L

n rn/2e2πnit[yn]g(y)L X

m rm/2e 2πmit[zm]g(z)L

[yn]g(y)L[zm]g(z)L r(n+m)/2 Z 1

0 dte2πit(n m)

n ([yn]g(y)L)2rn.

g is analytic with real coeﬃcients at 0 so ([yn]g(y)L)2 0 for all L, n. Thus, by the monotone convergence theorem, P n([yn]g(y)L)2 = limr 1 q L(r). However,

g( re2πiz)Lg( re 2πiz)L = 1 ei(θ+2πz) r L 1 ei(θ 2πz) r L 4L,

so P n([yn]g(y)L)2 < for all L. Now note

n ([yn](y f(y))L)2 = X

l=0 [yn](1 + y)l( 1)L lg(y)L l !2

l=0 ( 1)L l l X

[yn l ]g(y)L l !2

[yn l ]g(y)L l

1 (2ξ)2L (L2)2 X

2 [yn l ]g(y)L l 2

1 (2ξ)2L (L2)2 L X

[yn l ]g(y)L l 2 < .

Thus k has discrete masses when ζ = 0 as well.

Appendix H. Discrete masses in the local alignment kernel

In this section we prove that the local alignment kernel has discrete masses under certain parameter settings.

Amin, Marks, and Weinstein

Theorem 40 (proof of Theorem 23) If µ = 0, kla has discrete masses if and only if 2µ > log σ. If > µ > 0, kla has discrete masses if and only if 2µ log σ. If µ = , kla has discrete masses.

Proof We will only provide a sketch of the proof highlighting the diﬀerences to the case of the regular sequence kernel. The details are very similar to those of the proofs of Theorems 21 and 39. First we repeat the logic of Theorem 21 to reduce to the case of |B| = 1. Let U, WV , V [t], B as in Theorem 21. First note if V = , WV is simply kla with ks(U, U) = σ 1 on a space with |B| = 1. Thus, if this local alignment kernel does not have discrete masses, neither does kla on the entire space. Next consider |V | 1 and deﬁne the left-local alignment kernel

kl(X, Y ) =

l=0 k I (ks k I)l

and the right-local alignment kernel kr similarly. Now we repeat the logic of Eqn 10:

k(V [t], V [t ]) = X

l N Y (1)+ +Y (2l+1)=t0 U Z(1)+ +Z(2l+1)=t 0 U

l N Y (1)+ + Y (2l +1)=V [t1:] Z(1)+ + Z(2l +1)=V [t 1:]

k I(Y (1), Z(1))

i=1 ks(Y (2i), Z(2i))k I(Y (2i + 1), Z(2i+1))

ks(V(0), V(0))

i=1 k I( Y (2i 1), Z(2i 1)ks( Y (2i), Z(2i)))

k I( Y (1), Z(1))

=kl(t0 U, t 0 U)kr(V [t1:], V [t 1:])

=kl(t0 U, t 0 U)

l=1 kali(tl U, t l U)

kr(t L U, t L U).

Thus, kla has discrete masses on WV if kali, kl, and kr with ks(U, U) = σ 1 and |B| = 1 have discrete masses. Thus, the theorem holds if we prove the case where |B| = 1 for kali, kla, kl, and kr. We therefore study the case where |B| = 1. We ve already shown the theorem for the regular alignment kernel. We now consider kl. Let G(L, L , g) be the number of length L

sub-strings of a sequence of length L with g gaps not including a gap on the left end. In this case, each element of G(L, L , g) is characterized by the length of its g +1 contiguous regions, each of which must be at least length 1 except for the region after the last gap; and g + 1 gap regions, each of which must be at least length 1 except the ﬁrst gap. Thus by the same argument as in Theorem 39, we get that |G(L, L , g)| is the coeﬃcient in front

Kernels for Biological Sequences

of x L y L L zg of the power series

(1 x)g+1 yg

(1 y)g+1 = 1 (1 x)(1 y) (1 x)(1 y) (1 x)(1 y) xyz

1 1 y 1 xfz(y).

Thus in this case, deﬁning the matrix Q as in Theorem 39,

QL ,L = [x L y L]

1 1 y 1 xyf(y).

We can invert this matrix using Proposition 38. In particular, the numerator g(y) of the power series corresponding to the matrix inverse is 1 1 y

(y f(y)) 1 = 1 y f(y).

Q 1 L ,L = [x L y L] 1 y f(y)

In this case, kl has discrete masses if and only if for all L,

l=0 e ζl(Q 1 L,l)2 =

2 ζl[yl](1 y f(y))(y f(y))L 2 < .

The proof of the theorem is almost identical in the case that µ < . If µ = then f(y) = 1 so f(y) = 1 and Q 1 L ,L = δL (L) δL +1(L). Thus, only ﬁnitely many terms of the sum are non-zero, and Equation 14 is satisﬁed. The case is identical for kr. Finally we look at kla. Let G(L, L , g) be the number of length L sub-strings of a sequence of length L with g gaps not including gaps on either end. In this case, when g > 0, each element of G(L, L , g) is characterized by the length of its g + 1 contiguous regions, each of which must be at least length 1; and g + 2 gap regions, each of which must be at least length 1 except the ﬁrst and last gap. When g = 0 we must be sure to include the case where L = 0, even though L = 0 is impossible when g > 0. Thus, |G(L, L , g)| is the coeﬃcient in front of x L y L L zg of the power series

g=0 zg xg+1

(1 x)g+1 yg

(1 y)g+2 + 1 1 y = x (1 x)(1 y)2 (1 x)(1 y) (1 x)(1 y) xyz + 1 1 y

1 xfz(y) + 1 1 y.

So in this case,

QL ,L = [x L y L]xy

1 xyf(y) + 1 1 y.

Amin, Marks, and Weinstein

The ﬁrst row of Q is just 1 s and

(Q1:,1:)L ,L = [x L y L]

Q1:,1: is a Riordan array with inverse

(Q1:,1:) 1 L ,L = [x L y L]

1 xy f(y) .

Let R be the upper triangular metrix with R1:,1: = (Q1:,1:) 1, R0,0 = 1 and R0,1: = 1T (Q1:,1:) 1 where 1 is the inﬁnite vector of ones. We will see that R = Q 1. First, for L, L 1, QL, R ,L = δL(L ).

Next for L 0, QL, R ,0 = δL(0).

Finally, for L 1,

Q0, R ,L = (1T (Q1:,1:) 1)L 1 + 1T (Q1:,1:) 1 ,L 1 = 0.

Thus R = Q 1. Since 1L = 1 = [x L](1 x) 1,

1T (Q1:,1:) 1

L (Q1:,1:) 1 L ,L[x L ](1 x) 1

L [y L] (1 y f(y))2 y f(y) L [x L ](1 x) 1

=[y L](1 y f(y))2 X

y f(y) L [x L ](1 x) 1

=[y L](1 y f(y))2(1 y f(y)) 1

=[y L](1 y f(y))

Q 1 L ,L = [x L y L]xy

1 xy f(y) + 1 y(1 y f(y)).

The rest of the proof is again almost identical to the previous case when µ < . When µ = , Q 1 0,L = δ0(L) δ1(L) + δ2(L) and

(Q1:,1:) 1 L ,L = [x L y L](1 y)2

1 xy = δL (L) 2δL +1(L) + δL +2(L).

Thus kla again has discrete masses by Equation 14.

Kernels for Biological Sequences

Appendix I. Computation of thick tailed alignment kernels

We now describe dynamic programming procedures that can be used to calculate the thick tailed alignment kernels described in Examples 9 and 10. We will deal with both examples as special cases of a single general dynamic programming algorithm. Let ℓ(b, b ), taking values in 1, 0, be a function for counting occurrences of a certain type of letter pairs. Call pairs (b, b ) ℓ-type if ℓ(b, b ) = 1. Deﬁne

M(i, j, l) =

(l) X c 1 Y

i=0 k I(X(2i) (:i) , Y (2i) (:j) )ks(X(2i+1) (:i) , Y (2i+1) (:j) )

Where the sum is over c and over partitions X(0) (:i)+ +X(2c+1) (:i) = X(:i), Y (0) (:j)+ +Y (2c+1) (:j) =

Y(:j) for which l = Pc i=1 ℓ(X(2i+1) (:i) , Y (2i+1) (:J) ). Thus, M(i, j, l) sums over all alignments between X(:i) and Y(:j) with l comparisons that are ℓ-type and which end in a match between X(i 1) and Y(j 1). Let IX(i, j, l) be deﬁned similarly as the sum of alignment scores of X(:i) and Y(:j) that have l comparisons that are ℓ-type and end in an insertion for X but not Y . Finally let IY (i, j, l) be deﬁned similarly as the sum of alignment scores of X(:i) and Y(:j) that have l comparisons that are ℓ-type and end in an insertion for Y . Call the sum over all alignments of X(:i) and Y(:j) with l comparisons that are ℓ-type,

J(i, j, l) = M(i, j, l) + IX(i, j, l) + IY (i, j, l).

We can now compute M, IX and IY with dynamic programming. We initialize at,

M(i, 0, l) = M(0, j, l) = 0 for l, i, j

M(0, 0, 0) = 1

IX(0, j, l) = IY (i, 0, l) = 0 for l, i, j.

The update equations are,

M(i, j, l) = 1(l > 0)ℓ(X(i 1), Y(j 1))ks(X(i 1), Y(j 1))J(i 1, j 1, l 1)

+ 1 ℓ(X(i 1), Y(j 1)) ks(X(i 1), Y(j 1))J(i 1, j 1, l)

IX(i, j, l) = e µ µM(i 1, j, l) + e µIX(i 1, j, l)

IY (i, j, l) = e µ µM(i, j 1, l) + e µ µIX(i, j 1, l) + e µIY (i 1, j, l).

(We use the arbitrary convention that, between any two adjacent match positions, insertions in Y come before insertions in X; thus IY does not appear in the update equation for IX.) We can then deﬁne R(L) = J(|X|, |Y |, L) as the sum over all alignments of X and Y with L comparisons that are ℓ-type. The dynamic programming algorithm calculates R(L) with O(|X||Y |(|X| |Y |)) operations. Now consider the choice ks(b, b ) = 1 and ℓ(b, b ) = 1(b = b ). In this case, R(L) sums over all alignments where L of the matched positions have mismatched letters, i.e. where the Hamming distance between the matched positions is L. Thus, the kernel described in Example 9 can be computed as

L=0 (C + L) β R(L).

Amin, Marks, and Weinstein

Next set ks(b, b ) = δb(b ) and ℓ(b, b ) = 1, so R(L) sums over all alignments where there are L matched positions and all have matched letters. Recall the deﬁnition of u V as in Example 10. For sequences V, X, Y , u V (X) u V (Y ) is the sum over all alignments of X and Y such that the matched positions are V in both X and Y , so, R(L) = P |V |=L u V (X) u V (Y ). Thus, the kernel described in Example 10 can be computed as

2(|X| + |Y |) L β R(L).

Appendix J. Flexibility of embedding kernels

Proposition 41 (proof of Propositions 27 and 29) Assume k(z, z ) = Ψ(z z ) is a translation invariant kernel and that Ψ is a positive continuous function on RD that has a strictly positive Fourier transform. Then the embedding kernel k is characteristic and C0-universal if and only if F is injective. Furthermore, k metrizes P(S) if and only if k has discrete masses, which occurs if and only if F(S) has no accumulation points, that is, there is no sequence X S such that F(X) is in the closure of F(S \ {X}).

Proof First we prove that k metrizes P(S) only if F is injective and F(S) has no accumulation points. Say F(Xn) F(X) for a sequence of points with Xn = X for all n. k(Xn, Xn) = Ψ(0) = k(X, X) and k(Xn, Y ) = Ψ(F(Xn) F(Y )) Ψ(F(X) F(Y )) = k(X, Y ) for all Y . Thus

Z dδXn(Y )k Y Z dδX(Y )k Y

k = k Xn k X 2 k = 2k(X, X) 2k(Xn, X) 0

So k does not metrize P(S) as δXn δX. Now we show k has discrete masses if F is injective and F(S) has no accumulation points. Say k does not have discrete masses, so that, by Proposition 9, PMn m=1 αn,mk Xn,m k X in Hk for Xn,m = X. Call µn = PMn m=1 αn,mδF(Xn,m) δF(X). Deﬁning ˆΨ and ˆµn as the Fourier transforms of Ψ and µn, we have, as in Sriperumbudur et al. (2011), that Z dξ|ˆµn(ξ)|2 ˆΨ(ξ) = Z Z dµn(x)dµn(y)Ψ(x y) =

Z dµn(X)k X

Thus there is a subsequence (ˆµnk)k such that ˆµnk 0 pointwise almost everywhere with respect to the measure ˆΨ(ξ)dξ. Since ˆΨ is strictly positive, ˆµnk 0 pointwise almost everywhere with respect to the Lebesgue measure. This implies that µnk 0 in distribution. This can happen only if there is a sequence F(Xnk,mnk) F(X), i.e. F(S) has an accumulation point. Finally we show k is characteristic and universal if and only if F is injective. First, k is not characteristic if F is not injective, since this implies there are two sequences with the same kernel embedding. Now say F is injective and ν is a ﬁnite signed measure with R k Xdν(X) = 0. Deﬁne µ(C) = ν(F 1(C)) for all sets C RD. Then by the logic above, R dξ|ˆµ(ξ)|2 ˆΨ(ξ) = R k Xdν(X) k = 0. Thus, ˆµ = 0 so µ = 0. Since F is injective, this implies that ν = 0. Thus the map ν 7 R dν(X)k X is injective for all signed measures. This

Kernels for Biological Sequences

is equivalent to C0-universality by Proposition 2 of Sriperumbudur et al. (2011). This also means that the mapping is injective for all probability distributions, so k is characteristic.

Appendix K. Sparse representation learning can yield inﬂexible embedding kernels

Consider learning a sequence representation using a matrix factorization model with sparsityencouraging regularization. For example, consider the following objective,

d Wd Zid B 2 2 + λ

b B |Wdlb| + β

i=1 Zi 2 2. (16)

Here, as in the position-wise comparison kernel (Section 6), we add an inﬁnite tail of stop symbols to each sequence X S. We treat sequences as one-hot encodings. The vectors Zi RD are the representations of the training data, and are mapped linearly, with slope W and oﬀset B, to one-hot encoded sequence space. There is ℓ2 regularization on the embedding Z, with weight β, and ℓ1 regularization on W, with weight λ. This lasso regularization on W encourages each of its entries to be zero, and can make representations more interpretable, in the sense that each dimension of Z only aﬀects a small number of sequence positions. Once the model has been trained, we can use it to deﬁne an embedding function F(X) = argmin Z RD X P d ˆWd Zd ˆB 2 2 + β Z 2 2, where ˆW and ˆB are the parameter values estimated from the training data. On a ﬁnite training data set, there are likely to be positions l in ˆW with more than one zero, i.e. for b = b and all d {1, . . . , D} we have Wdlb = Wdlb = 0. In this case, a sequence X with Xlb = 1 will map to the same representation as another sequence X

that is identical to X except with a diﬀerent letter at position l, namely Xlb = 1. In other words, F( X) = F( X ) for X = X, and so F is not injective, and an embedding kernel based on F will be neither universal nor characteristic.

Appendix L. Scaled random embeddings

Proposition 42 (proof of Propositon 31) Consider an initial embedding F where each F(X) for X S is drawn from the uniform distribution on the sphere, {x RD | x 1}. Then, a kernel using the scaled embedding F(X) = |B|(1+ϵ)|X|/D F(X), for ϵ > 0, metrizes P(S) almost surely.

Proof The probability that F is injective is 1, following the same logic as in the proof of Proposition 28. Now, for any N we will show that with probability 1, {X | F(X) N} is ﬁnite. This will imply that for any sequence of distinct sequences X1, X2, . . . , we must have F(Xn) as n , and in particular the embedding F(S) has no accumulation points. Now,

p( F(X) N) = p( F(X) N/|B|(1+ϵ)|X|/D).

Amin, Marks, and Weinstein

If |B| (1+ϵ)|X|/DN 1 then this quantity is 1. Otherwise, calling V the Lebesgue measure and B(1) the unit ball {x RD | x 1}, then

p( F(X) N) = V (|B| (1+ϵ)|X|/DNB(1))

V (B(1)) = |B| (1+ϵ)|X|ND.

X S p( F(X) N) ND X

X S |B| (1+ϵ)|X| = ND X

L=0 |B| Lϵ < .

So by the Borel-Cantelli lemma, with probability 1, F(S) has no accumulation points.

10x Genomics. A new way of exploring immunity: linking highly multiplexed antigen recognition to immune repertoire and phenotype. 10x Genomics, 2022.

Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence speciﬁcities of DNAand RNA-binding proteins by deep learning. Nat. Biotechnol., 33(8):831 838, 2015.

Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed Al Quraishi, and George M Church. Uniﬁed rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16(12):1315 1322, December 2019.

Alan N Amin, Eli N Weinstein, and Debora S Marks. A generative nonparametric bayesian model for whole genomes. Advances in Neural Information Processing Systems 34 (Neur IPS 2021), December 2021.

Miguel A Arcones and Evarist Gine. On the bootstrap of U and V statistics. Ann. Stat., 20(2):655 674, 1992.

Balandat, Karrer, Jiang, Daulton, Letham, Wilson, and Bakshy. Bo Torch: A framework for eﬃcient Monte-Carlo bayesian optimization. Adv. Neural Inf. Process. Syst., 2020.

Luis A Barrera, Anastasia Vedenko, Jesse V Kurland, Julia M Rogers, Stephen S Gisselbrecht, Elizabeth J Rossin, Jaie Woodard, Luca Mariani, Kian Hong Kock, Sachi Inukai, Trevor Siggers, Leila Shokri, Raluca Gordˆan, Nidhi Sahni, Chris Cotsapas, Tong Hao, Song Yi, Manolis Kellis, Mark J Daly, Marc Vidal, David E Hill, and Martha L Bulyk. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science, 351(6280):1450 1454, March 2016.

Surojit Biswas, Grigory Khimulya, Ethan C Alley, Kevin M Esvelt, and George M Church. Low-N protein engineering with data-eﬃcient deep learning. Nat. Methods, 18(4):389 396, 2021.

Dexiong Chen, Laurent Jacob, and Julien Mairal. Recurrent kernel networks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 13453 13464. Curran Associates Inc., Red Hook, NY, USA, December 2019.

Kernels for Biological Sequences

Andreas Christmann and Ingo Steinwart. Universal kernels on Non-Standard input spaces. In J Laﬀerty, C Williams, J Shawe-Taylor, R Zemel, and A Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010.

Bo Dai, Hanjun Dai, Arthur Gretton, Le Song, Dale Schuurmans, and Niao He. Kernel exponential family estimation via doubly dual embedding. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artiﬁcial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2321 2330. PMLR, 2019.

Charita Dellaporta, Jeremias Knoblauch, Theodoros Damoulas, and Fran cois-Xavier Briol. Robust bayesian inference for simulator-based models via the MMD posterior bootstrap. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), February 2022.

Nicki Skafte Detlefsen, Søren Hauberg, and Wouter Boomsma. Learning meaningful representations of protein sequences. Nat. Commun., 13(1):1914, April 2022.

Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, April 1998.

Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph K Min, Kelly Brock, Yarin Gal, and Debora S Marks. Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91 95, 2021.

Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Sch olkopf. Kernel measures of conditional dependence. In Neural Information Processing Systems (Neur IPS), 2007.

Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, and Bharath K Sriperumbudur. Characteristic kernels on groups and semigroups. Adv. Neural Inf. Process. Syst., 21, 2008.

Futoshi Futami, Zhenghang Cui, Issei Sato, and Masashi Sugiyama. Bayesian posterior approximation via greedy particle optimization. AAAI, 33(01):3606 3613, July 2019.

Thomas G artner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and eﬃcient alternatives. In Learning Theory and Kernel Machines, pages 129 143. Springer Berlin Heidelberg, 2003.

Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1292 1301. PMLR, 2017.

Arthur Gretton, Kenji Fukumizu, Choon Teo, Le Song, Bernhard Sch olkopf, and Alex Smola. A kernel statistical test of independence. Adv. Neural Inf. Process. Syst., 20, 2007.

Amin, Marks, and Weinstein

Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander Smola. A kernel two-sample test. J. Mach. Learn. Res., 13:723 773, 2012.

Ben Hambly and Terry Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Ann. Math., 171(1):109 167, March 2010.

David Haussler. Convolution kernels on discrete structures ucsc crl. 1999.

Ferenc Husz ar and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 12, pages 377 386, Arlington, Virginia, USA, August 2012. AUAI Press.

T Jaakkola, M Diekhans, and D Haussler. A discriminative framework for detecting remote protein homologies. J. Comput. Biol., 7(1-2):95 114, February 2000.

Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 18, pages 8580 8589, Red Hook, NY, USA, December 2018. Curran Associates Inc.

P Jorgensen and F Tian. Discrete reproducing kernel hilbert spaces: sampling and distribution of dirac-masses. J. Mach. Learn. Res., 2015.

Franz J Kiraly and Harald Oberhauser. Kernels for sequentially ordered data. J. Mach. Learn. Res., 20(31):1 45, 2019.

Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels. Applied Network Science, 5(1):1 42, January 2020.

Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence and graph kernels. In International Conference on Machine Learning, pages 2024 2033. PMLR, July 2017.

Christina Leslie, Eleazar Eskin, and William Staﬀord Noble. The spectrum kernel: a string kernel for SVM protein classiﬁcation. Pac. Symp. Biocomput., pages 564 575, 2002.

Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab as P oczos. MMD GAN: towards deeper understanding of moment matching network. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pages 2200 2210, Red Hook, NY, USA, December 2017. Curran Associates Inc.

Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2:419 444, 2002.

Takuo Matsubara, Jeremias Knoblauch, Fran cois-Xavier Briol, and Chris J Oates. Robust generalised bayesian inference for intractable likelihoods. J. R. Stat. Soc. Series B Stat. Methodol., 84(3):997 1022, July 2022.

Radford M Neal. Priors for inﬁnite networks. In Radford M Neal, editor, Bayesian Learning for Neural Networks, pages 29 53. Springer New York, New York, NY, 1996.

Kernels for Biological Sequences

Pascal Notin, Jos e Miguel Hern andez-Lobato, and Y Gal. Improving black-box optimization in VAE latent space using decoder uncertainty. Adv. Neural Inf. Process. Syst., 2021.

Pascal Notin, Nathan Rollins, Yarin Gal, Chris Sander, and Debora Marks. Machine learning for functional protein design. Nat. Biotechnol., 42(2):216 228, February 2024.

Mijung Park, Wittawat Jitkrittum, and Dino Sejdinovic. K2-ABC: Approximate bayesian computation with kernel embeddings. In Arthur Gretton and Christian C Robert, editors, Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 398 407, Cadiz, Spain, 2016. PMLR.

Roman Pogodin, Namrata Deka, Yazhe Li, Danica J Sutherland, Victor Veitch, and Arthur Gretton. Eﬃcient conditionally invariant representation learning. In International Conference on Learning Representations (ICLR), September 2022.

Luc Pronzato and Anatoly Zhigljavsky. Bayesian quadrature, energy minimization, and Space-Filling design. SIAM/ASA J. Uncertainty Quantiﬁcation, 8(3):959 1011, January 2020.

Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S Song. Evaluating protein transfer learning with TAPE. In Advances in Neural Information Processing Systems, volume 32, pages 9689 9701, 2019.

G R atsch, S Sonnenburg, and B Sch olkopf. RASE: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21 Suppl 1:i369 77, June 2005.

A M Rush. Torch-struct: Deep structured prediction library. ar Xiv preprint ar Xiv:2002.00876, 2020.

Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11):1682 1689, July 2004.

Gabriele Schweikert, Gunnar R atsch, Christian Widmer, and Bernhard Sch olkopf. An empirical analysis of domain adaptation algorithms for genomic sequence analysis. Adv. Neural Inf. Process. Syst., 21, 2008.

Bernhard Sch olkopf and Alexander J Smola. 13 designing kernels. In Learning with Kernels. The MIT Press, Massachusetts Institute of Technology, December 2021.

Louis Shapiro, Renzo Sprugnoli, Paul Barry, Gi-Sang Cheon, Tian-Xiao He, Donatella Merlini, and Weiping Wang. The Riordan Group and Applications. Springer International Publishing, 2022.

Louis W Shapiro, Seyoum Getu, Wen-Jin Woan, and Leon C Woodson. The riordan group. Discrete Appl. Math., 34(1-3):229 239, November 1991.

Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for multistage causal inference: Mediation analysis and dynamic treatment eﬀects. November 2021.

Amin, Marks, and Weinstein

Rahul Singh, Liyuan Xu, and Arthur Gretton. Kernel methods for causal functions: Dose, heterogeneous, and incremental response curves. Biometrika, 111(2):497 516, 2023.

S oren Sonnenburg, Gunnar R atsch, and Konrad Rieck. Large-scale learning with string kernels. In Large-Scale Kernel Machines. The MIT Press, 2007.

Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Gert R G Lanckriet, and Bernhard Sch olkopf. Kernel choice and classiﬁability for RKHS embeddings of probability distributions. Advances in Neural Information Processing Systems 22 - Proceedings of the 2009 Conference, pages 1750 1758, 2009.

Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch olkopf, and Gert R G Lanckriet. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11(50):1517 1561, 2010.

Bharath K Sriperumbudur, Kenji Fukumizu, and Gert R G Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res., 12(70): 2389 2410, 2011.

Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maﬀettone, Emily Delaney, Peyton Greenside, and Andrew Gordon Wilson. Accelerating Bayesian optimization for biological sequence design with denoising autoencoders. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20459 20478. PMLR, 2022.

Nora C Toussaint, Christian Widmer, Oliver Kohlbacher, and Gunnar R atsch. Exploiting physico-chemical properties in string kernels. BMC Bioinformatics, 11 Suppl 8:S7, October 2010.

Koji Tsuda, Taishin Kin, and Kiyoshi Asai. Marginalized kernels for biological sequences. Bioinformatics, 18 Suppl 1:S268 75, 2002.

Jean-Philippe Vert, Hiroto Saigo, and Tatsuya Akutsu. Local alignment kernels for biological sequences. In Kernel Methods in Computational Biology, pages 131 153. MIT Press, 2004.

Chris Watkins. Dynamic alignment kernels. In Advances in Large-Margin Classiﬁers, pages 39 50. The MIT Press, September 2000.

Eli N Weinstein. Generative Statistical Methods for Biological Sequences. Ph D thesis, Harvard University, Ann Arbor, United States, 2022.

Eli N Weinstein and Debora S Marks. A structured observation distribution for generative biological sequence prediction and forecasting. In Proceedings of the 38th International Conference on Machine Learning, 139, pages 11068 11079. PMLR, 2021.

Eli N Weinstein and Jeﬀrey W Miller. Bayesian data selection. J. Mach. Learn. Res., 24 (23):1 72, 2023.

Kernels for Biological Sequences

Jason Weston, Bernhard Sch olkopf, Eleazar Eskin, Christina Leslie, and William Staﬀord Noble. Dealing with large diagonals in kernel matrices. Ann. Inst. Stat. Math., 55(2): 391 408, June 2003.

Christopher Williams. Computing with inﬁnite networks. Adv. Neural Inf. Process. Syst., 9, 1996.

Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In Arthur Gretton and Christian C Robert, editors, Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 370 378, Cadiz, Spain, 2016. PMLR.

Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learned protein embeddings for machine learning. Bioinformatics, 34(15):2642 2648, August 2018.

Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine-learning-guided directed evolution for protein engineering. Nat. Methods, 16(8):687 694, August 2019.