# kernels_for_sequentially_ordered_data__3bcf8849.pdf Journal of Machine Learning Research 20 (2019) 1-45 Submitted 6/16; Revised 2/18; Published 2/19 Kernels for Sequentially Ordered Data Franz J. Kir aly f.kiraly@ucl.ac.uk Department of Statistical Science University College London London WC1E 6BT, United Kingdom Harald Oberhauser oberhauser@maths.ox.ac.uk Mathematical Institute University of Oxford Oxford OX2 6GG, United Kingdom Editor: Le Song We present a novel framework for learning with sequential data of any kind, such as multivariate time series, strings, or sequences of graphs. The main result is a sequentialization that transforms any kernel on a given domain into a kernel for sequences in that domain. This procedure preserves properties such as positive definiteness, the associated kernel feature map is an ordered variant of sample (cross-)moments, and this sequentialized kernel is consistent in the sense that it converges to a kernel for paths if sequences converge to paths (by discretization). Further, classical kernels for sequences arise as special cases of this method. We use dynamic programming and low-rank techniques for tensors to provide efficient algorithms to compute this sequentialized kernel. Keywords: Sequential data, kernels, signature, ordered moments, signature kernels 1. Introduction Sequentially ordered data are ubiquitous in modern science, occurring as time series, location series, or, more generally, sequentially observed samples of structured objects. Two stylised facts make learning with sequential data an ongoing challenge: (A) Sequential data is very diverse, e.g. it includes sequences of different length of scalars, letters, images, graphs (in time series, text, video, network evolution) or even heterogeneous combinations of these. This diversity is usually addressed by ad-hoc approaches, extensive pre-processing and manual extraction of features, thus specific to the data at hand. (B) Sequential data is often large, with sequences easily obtaining the length of hundreds, thousands, millions. Hence, the data sets quickly become very huge, and with them computational cost. In this paper, we present a novel approach to learning with sequential data based on joining ideas from stochastic analysis and kernel learning, addressing the points above: c 2019 Franz J. Kir aly and Harald Oberhauser. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/16-314.html. Kir aly and Oberhauser (a) The (discretized) signature is a canonical feature map for sequences. It can intuitively be described as an ordered version of sample moments. It completely describes a sequence and makes sequences of different size and length comparable. (b) The kernel trick applied to signature features avoids the combinatorial explosion of coordinates inherent to the signature feature map (in analogy to the classic polynomial kernel in Rd). The main results of this article can be described as follows: if X is a topological space ( structured objects ) and we denote with X+ = S ℓ 1 Xℓthe set of arbitrary length sequences in X ( sequences of structured objects ) then we construct a map that takes a kernel k : X X R to a kernel k+ : X+ X+ R. We call k+ the discretized signature kernel (or sequentialization) over k and provide efficient algorithms to evaluate k+. This transformation is canonical in the sense that it preserves important properties of k, such as positive definiteness and it is consistent in the sense that k(x, y) for x, y X+ converges to an inner product of a feature map for paths that is classic in stochastic analysis (the path signature) if the sequences x, y converge to paths (from discrete to continuous time). Classical kernels for sequences such as string, global alignment or relation-convolution kernels can be identified as special cases of the discretized signature kernel k+; for example string kernels arise by taking X to be a finite set of letters . More interestingly, even for X = R this yields new kernels for time series by the sequentialization of classic non-linear kernels such as Gaussian, Laplace, etc. From a methodological point of view, this gives a canonical way to transform static features (resp. kernels k) for structured objects into features (resp. kernels k+) for evolving structured objects. 1.1. Kernels for paths We first discuss the situation for path-valued observations (continuous time) rather than sequence-valued observations (discrete time) the latter then follows by discretizing time. That is, if we denote with PX X[0,1] a sufficiently regular subset of the space of paths that evolve in X (without loss of generality the paths are parameterized by [0, 1]), our aim is to transform a kernel k : X X R into a kernel k : PX PX R. Therefore, first recall that X u 7 ku := k(u, ) H RX can be thought of as the (canonical) feature map induced by k. It is then natural to proceed as follows: 1. Lift a path evolving in data space X to a path evolving in feature space H. We lift the path x that evolves in data space X to a path1 kx that evolves in the feature space H via the kernel k. That is, we map x = (x(t))t [0,1] PX to kx = kx(t) where PH denotes a sufficiently regular subset of H[0,1]. 1. Note the slight abuse of notation that ku H if u X and kx PH if x PX. Kernels for sequentially ordered data 2. Signature features for paths in H. We use a feature map that is well-known in stochastic analysis for paths that evolve in linear spaces such as H. This map is called the signature map S. It maps a path h PH into a series of tensors, h 7 S (h) Y By applying this map to the lifted path kx PH, we get the feature map PX Q m 0 H m, defined as x 7 S(kx). 3. Signature kernels. Taking inner products yields the kernel k : PX PX R defined as inner product k (x, y) = S(kx), S(ky) . It turns out, that k has a recursive structure that allows for an efficient and robust calculation even though kx, ky are paths evolving in the (in general infinitedimensional) state space H. Below we give more details about the last two points of this construction. 1.2. The signature map The signature S maps a (sufficiently regular) path h PH H[0,1] into a series of tensors, h 7 S(h) = (Sm(h))m 0 Y where by convention H 0 := R and S0(h) := 1. The remaining terms Sm(h) are defined as follows: the first degree, S1(h), of the signature is simply the average change, S1(h) := Et[ h(t)] H, where h denotes the derivative of h and the expectation is taken over t sampled uniformly from [0, 1]. The second degree of the signature, S2(h), is the (non-centred) covariance of changes at two subsequent time points, that is 2Et1s1 t2>t1 κ(ds2, dt2) s3>s2 t3>t2 [. . . ] Finally, the data we have access is to is only a sequence (x(t1), . . . , x(tn)) X+ rather than the whole path x = (x(t))t [0,1] PX, but replacing the integrals by sums immediately gives an explicit recursive formula for a kernel k+ : X+ X+ R that is determined by a Horner-type3 recursive evaluation; especially, one never computes the signature map of a path that evolves in the high/infinite-dimensional Hilber space H directly. This yields a robust and computationally effective formula for a kernel k+ for sequences, that has well-defined behaviour in the scaling limit when sequences approximate paths. We make the above informal derivation rigorous in the following sections, study its properties and extended it to non-smooth paths in Appendix B. 1.4. Prior art and literature review Learning with sequential data is a vast field. Prior art that inspired our present approach can be found in (i) dynamic programming algorithms for sequence comparison in the engineering community, (ii) kernel and Gaussian processes for sequences in the machine learning community, (iii) Rough paths in the stochastic process community. Beyond the above, we are not aware of literature in statistics that deals with sequencevalued data points in a way other than first identifying one-dimensional sequences with real vectors of same size, or even forgetting the sequence structure entirely and replacing the sequences with (order-agnostic) aggregates such as cumulants, quantiles or principal component scores. Below we discuss above three points in more detail: (i) Dynamic programming algorithms for sequence comparison. The earliest use of order information in sequences for learning can probably be found in Sakoe and Chiba (1970) and Sakoe (1979) by using editing or distortion distances. These distances are then employed for classification by maximum similarity/minimum distance principles. Through theoretical appeal and efficient computability, sequence comparison methods, later synonymously called dynamic time warping methods, have become one of the standard methods in comparing sequential data Kruskal (1983); Giorgino (2009). Sequence comparison methods in their original formulation can only be directly applied to 3. Horner s scheme to evaluate a uni-variate polynomial p(x) = Pm i=0 aixi is to write p(x) = a0 + x (a1 + x (a2 + + x (am 1 + xam))) and compute the brackets from the inside to the outside. This needs only n additions and n multiplications which is optimal and in contrast to the n additions and n2+n 2 multiplications needed for the naive evaluation of p(x); further, it is numerically more stable since one never adds floats that live on different scales (such as a1x and amxm). This structural similarity with formula (1) will be the key to our algorithms. Kir aly and Oberhauser relatively simple distance-based learning algorithms. This has been addressed by combining such methods with kernel learning (discussed below). (ii) Kernels and Gaussian processes for sequences. Kernel learning is one of the most popular approaches to make non-linear data of arbitrary kind amenable to classic and scalable linear algorithms and it provides learning theoretical guarantees, see Sch olkopf and Smola (2002); Shawe-Taylor and Cristianini (2004); Cucker and Smale (2002). Kernels for strings, that is, sequences of symbols, were among the first to be considered by Haussler (1999) and fast algorithms were obtained a few years later Lohdi et al. (2002); Leslie and Kuang (2004). Kernels based on the above-mentioned dynamic programming (time warping) approach were developed, for sequences of arbitrary objects leading to so-called alignment kernel, see Bahlmann et al. (2002); Noma (2002); Cuturi et al. (2007); Cuturi (2011). All of the mentioned kernels can be viewed from Haussler s original, visionary relation-convolution kernel framework, Haussler (1999), the existing literature provides no unifying approach to kernels for sequences: for example the relation between string kernels and dynamic time warping/global alignment kernels, or to the classical theory of time series has remained unclear. Further, the only known kernels for sequences of arbitrary objects, the dynamic time warping/global alignment kernels, are in general not positive definite. (iii) Rough paths and stochastic processes. Series of iterated integrals appear in diverse areas such as control theory, combinatorics, homotopy theory, physics (Feynman Dyson Schwinger theory) and more recently probability theory; see (Lyons, 2004, Section Historic papers , p97). This series is treated under various names in the literature like Magnus expansion , time ordered or non-commutative exponential , or the one we follow, namely the signature of a path . Signatures have found many applications in stochastic analysis (rough path theory, regularity structures), but their use in statistics and machine learning is very recent: Papavasiliou et al. (2011) applies it to SDE parameter estimation; Levin et al. (2013); Lyons et al. (2014) applies it to forecasting and classification; Yang et al. (2015) uses signature features as input to deep neural nets. So far, all applications are restricted to paths evolving in low-dimensional linear spaces due to the computational bottleneck given by the combinatorial explosion of signature coordinates. 1.5. Outline Section 2 introduces the signature as a canonical feature map for path-valued data. Section 3 shows that this signature can be kernelized to get a kernel k for paths and studies its properties. Section 4 studies the discretization to turn k into a kernel k+ for sequences. Section 5 shows that classic kernels for sequences arise as special cases of our construction. Section 6 presents effective algorithms for computing k+ using above recursion with low-rank techniques. Section 7 presents a Numpy implementation and basic benchmarks. 1.6. Notation Kernels for sequentially ordered data Spaces, sequences and paths X an arbitrary set X+ X+ := S ℓ 1 Xℓsequences of arbitrary length in X PX a set of continuous paths from [0, 1] to X H the reproducing kernel Hilbert space, H RX, of the kernel k H+ H+ := S ℓ 1 Hℓsequences of arbitrary length in H+ PH a set of continuous bounded variation paths from [0, 1] to H Signature features for paths and sequences in H Q m 0 H m the linear space of series of tensors in H S the signature map S : PH Q m 0 H m S+ the discretized signature map S+ : H+ Q m 0 H m , inner product on the subset of square-summable elements of Q m 0 H m norm given by , for square summable elements of Q m 0 H m k a kernel k : X X R kx with x X the element kx := k(x, ) H RX kx with x X+ the sequence kx := (kxi)i=1,...,ℓ H+ with x = (xi)i=1,...,l kx with x PX the path kx := (kx(t))t [0,1] PH with x = (x(t))t [0,1] k+ a kernel k+ : X+ X+ R, k+(x, y) := S+(kx), S+(ky) k a kernel k : PX PX R, k (x, y) := S(kx), S(ky) Partitions of [0, 1] π a partition of [0, 1], i.e. a collection of points 0 = t1 < < tl = 1 mesh(π) mesh(π) := maxi=1,...,l 1 |ti+1 ti| xπ for x PX xπ := (x(ti))i=1,...,l X+, the path x sampled along π 1.7. Author s contribution Both authors contributed equally, HO is the corresponding author. 2. The signature feature map Mapping paths to series of tensors is well-known technique in stochastic analysis. In this section we give a short introduction and highlight aspects of the signature map that make it a good feature map for paths. 2.1. Bounded variation paths Definition 2.1 Let H a Hilbert space. We denote the set of H-valued paths of bounded variation on [0, 1] starting at the origin by C1([0, 1], H) := {h C ([0, 1], H) : h(0) = 0, h 1 < } Kir aly and Oberhauser where h 1 = supπ Pl 1 i=1 h(ti+1) h(ti) H and the supremum is taken over all finite partitions π = {(ti)i=1,...,l : 0 = t1 < < tl = 1} of [0, 1]. As in the finite-dimensional case, the integral 1 0 y(t) dh(t) H is defined as the limit of Riemann Stieltjes sums, see Appendix A for details. The integral itself is in general not a scalar, but an element of the Hilbert space, y dh H. Definition 2.2 We define the sequence dh m m 0 Q m 0 H m as4 ˆ dh 0 := 1 and ˆ dh m := ˆ 1 0 dh (m 1) dh(t). S : C1 ([0, 1] , H) Y m 0 H m , h 7 ˆ dh m and call S(h) the signature of the path h and denote Sm(h) = dh m. Example 2.3 Let H = R2 and h(t) = (a(t), b(t)) . The first levels of the signature of h are S0(h) = 1, S1(h) = 1 0 da(t) 1 0 db(t) 1 0 t2 0 da(t1) da(t2) 1 0 t2 0 da(t1) db(t2) 1 0 t2 0 db(t1) da(t2) 1 0 t2 0 db(t1) db(t2) The degree m = 3, S3(h) R2 3, has 23-scalar valued coordinates, etc. In general, for a path in Rd we need dm scalars to describe dh m and it is exactly this combinatorial explosion that makes the use of signatures prohibitively expensive already for moderately high-dimensional H and impossible to use for infinite-dimensional H (such as the RKHS associated with most kernels relevant in machine learning). Example 2.4 For the special case of a linear path in H = Rd, i.e. h(t) = t v for a fixed vector v Rd, a direct calculation shows that Sm(h) = v m m! = h(1) m or in more concise notation5, S(h) = exp(v) = exp(h(1) h(0)). This explains why the signature S(h) is also known as time-ordered exponential. This analogy with the exponential function will be key motivation for the estimates in Section 4. 4. Recall that by convention H 0 := R. 5. where one denotes t = (tm)m Q m 0 H m as t = t0 + t1 + t2 + , set exp(t) = P m! and identify v Rd as t = 0 + v + 0 + 0 + . Kernels for sequentially ordered data 2.2. The signature as a canonical feature map The signature S(h) describes a path h by an element in the linear space Q m 0 H m and the following properties make it a good feature map 1. The map h 7 S(h) is continuous, 2. The map h 7 S(h) is injective up to tree-like equivalence6, 3. For any f C(K, R), K C1([0, 1], H) compact, and ϵ > 0 there exists a ℓ L m 0 H m such that sup h K |f (h) ℓ, S (h) | < ϵ. The proof is classic in stochastic analysis and given in Appendix A; the key insight is that the space of linear combinations of iterated integrals forms an algebra. To sum up 1. the signature features are a mathematically faithful representation of the underlying path-valued data h. 2. linear combinations of signature coordinates approximate continuous functions of paths arbitrarily well. In the kernel learning context this is known as universality . 3. the signature features are sequentially ordered analogues to moments. Similar to polynomials, the feature space has the natural grading by m designating the polynomial degree . 3. Signature kernels We come back to our original motivation: we are given a RKHS (H, k) on X and we want to transform this into a RKHS for paths in X. Therefore recall that the reproducing property implies that k : X X R equals k(u, v) = ku, kv H and that one may think of X u 7 ku( ) := k(u, ) H as the (canonical) feature map induced by the kernel k. Now given a path x PX we lift it to a path7 kx PH given as 6. We call h, h tree-like equivalent if the concatenation h h 1 is tree-like. A path z C1([a, b], H) is called tree-like if there exists a continuous map f : [a, b] [0, ) with f (0) = f (T) = 0 and |z (t) z (s)| f (s) + f (t) 2 infu [s,t] f (u) for all s t [a, b]. The key remark is that being tree-like equivalent is a very strong assumption, typically not seen in real data. Even if presented with a tree-like path, adding time as extra coordinate transforms the path to a non-treelike path (that is, working with t 7 (t, h(t)) instead of t 7 h(t)). 7. Recall the slight clash of notation ku H for u X and kx PH if x PX, but the meaning will always be clear from the context. In fact, later we are also going to use kx H+ if x X+. Kir aly and Oberhauser t 7 kx(t) := kx(t)( ) := k(x(t), ). The results of Section 2 suggest S(kx) as features for kx, that is the feature map PX x 7 S(kx) Y m 0 H m. (2) Theorem 1 guarantees that S(kx) captures all the information about x PX that can be obtained by looking at the path x through the kernel k. Most RKHS H in machine learning are infinite dimensional which makes this approach infeasible; even for finite dimensional H we suffer from the combinatorial explosion of the number of signature coordinates, Example 2.3. The main result of this section is that the feature map (2) can be complety kernelized, that is by defining (x, y) 7 k (x, y) = S(kx), S(ky) we get a kernel for paths in X that can be very efficiently evaluated using only k (x(s), y(t)) for s, t [0, 1]. That is, only evaluations of the static kernel k : X X R. We henceforth restrict attention to positive definite kernels on X and a set of paths PX that lift to bounded variation paths. All our arguments extend to unbounded variation such as semi-martingales, diffusion processes, Markov processes or Gaussian processes and we give the needed modifications in Appendix B. Assumption 3.1 Throughout the remainder of this article we assume that k : X X R is continuous and positive definite. We denote with H the associated RKHS. Further, we denote with PX X[0,1] a set of paths in a topological space X and (H, k) a RKHS on X such that {t 7 kx(t) : x PX} C1([0, 1], H). This yields the following kernel, Definition 3.2 We call k :PX PX R , (x, y) 7 S(kx), S(ky) , the signature kernel over k. Theorem 2 The signature kernel over k k : PX PX R, k (x, y) = S (kx) , S (kx) 1. is a positive definite kernel, 2. k (x, y) = P m 0 s1< i1 j2>j1 i2,j2 k(x, y) 1 + P i3>j2 j3>j2 . . . , where we use the notation x = (xi)|x| i=1 , y = (yi)|y| i=1 X+ and i,j k(x, y) := k(xi+1, yj+1) + k(xi, yj) k(xi, yj+1) k(xi+1, yj). Kernels for sequentially ordered data Proof The first point follows, since k+ is an inner product. For the second point note that i kx kxi+1 kxi, i ky kyj+1 kyj and that by the reproducing property i kx, j ky = i,j k(x, y). The identity then follows from multiplying out S+(kx), S+(ky) = Y i (1 + i kx), Y j (1 + j ky) and using the definition of the inner product on H m. The last point follows by rearranging the summation. We can now combine this with Theorem 3 to quantify how fast the kernel k+ for sequences approximates the kernel k for paths as the mesh of partitions gets finer. Corollary 4.7 Let x, y PX, and π = (si)k i=1 with 0 = s1 < < sk = 1 and ρ = (tj)l j=1 with 0 = t1 < < tl = 1. Then, k+(xπ, yρ) k (x, y) 4e x 1+ y 1 2e x (π)+ y 1 2e x 1+ y (ρ). In particular, lim x (π) x 1 y (ρ) y 1 k+(xπ, yρ) = k (x, y) where convergence is uniform of order O maxi x[si,si+1] 1 + maxj y[tj,tj+1 1 Proof It holds that k+(xπ, yρ) k (x, y) = S+(kπ x), S+(kρ y) S(kx), S(ky) = S+(kπ x), S+(kρ y)) S(ky) + S+(kπ x) S(kx), S(ky) . The Cauchy Schwarz inequality implies that | S+(kπ x), S+(kρ y)) S(ky) | S+(kπ x) S+(kρ y) S(ky) . Theorem 3 implies S+(kπ x) S+(kρ y)) S(ky) 2 exp( x 1) exp( y 1 exp( y ρ) and similarly, one obtains S+(kρ y), S+(kπ x) S(kx) 2 exp( y 1) (exp( x 1) exp( x π) . Putting all (in-)equalities together yields the main claim, the convergence statement follows from Theorem 3 (ii). Kir aly and Oberhauser 4.3. Truncation and variations on a theme The final step to get a practically useful kernel for sequences is to truncate the discrete signature features S+(kx) Q m 0 H m at a given degree m 1. In analogy with the classic polynomial features (resp. kernel), this not only serves a computational purpose but is also necessary to avoid overfitting; in applications the truncation degree m will be typically determined by cross-validation. Definition 4.8 Given an integer m 1, we call k+ m : X+ X+ R, k+ m(x, y) = S+(kx), S+(ky) m the discretized signature kernel (or sequentialization) over k truncated at degree m. Here we use the notation kx = (kxi)i H+ for x = (xi)i X+ and , m for the inner product truncated9at m. Corollary 4.9 The discretized signature kernel k+ m over k truncated at degree m, k+ m : X+ X+ R, k+ m(x, y) = S+(kx), S+(ky) m, 1. is a positive definite kernel, 2. k+ m(x, y) = Pm d=1 P 1 i1< i1 1 |y|>j1 1 i1,j1 k(x, y) 1 + (1 + P |x|>im>jm 1 |y|>jm>jm 1 im,jm k(x, y)) where we denote x = (xi)|x| i=1 , y = (yi)|y| i=1 X+ and i,j k(x, y) := k(xi+1, yj+1) + k(xi, yj) k(xi, yj+1) k(xi+1, yj). While above follows as simple corollary from the discussion so far, the identity Point (3) will be the key to an efficient algorithm; it is clearly more efficient than computing first S+(x) and S+(y) and then their inner product; but it is also much more efficient than the identity in Point (2) since it only uses i,j k(x, y) for each i, j once. Remark 4.10 (Variations on a theme) Modifications of the above are possible. For example, (i) One can omit the first differences, that is consider the sequentialization k+ : X+ X+ R; (x, y) 7 1 + X m 1 1 i1< j A[m 1|i , j ] that leads to a blown up computational cost of O(mℓ2 xℓ2 y) at the asymptotically insignificant gain of storing one (ℓx ℓy)-matrix less (the matrix Q). Kernels for sequentially ordered data 6.3. Large scale strategies By using Algorithm 3 one can compute the Gram matrix k+ m(xi, xj) i,j=1,...,n of n sequences x1, . . . , xn X+ in O(n2 ℓ2 m) elementary arithmetic operations. While this is for moderate sizes of n,m and ℓachievable on contemporary desktop computers, it becomes quickly prohibitive for large n, ℓdue to the quadratic growth especially when combined with parameter tuning or cross-validation schemes (as later in our experiments). We present two low-rank techniques to address this: the first is classic and reduces the quadratic cost in n to linear, the second reduces the quadratic cost in ℓto linear. Finally, we combine these two techniques to get O(n ℓ m ρ) computation time with ρ denoting an approximation parameter. 6.3.1. Low-rank methods for the sequence-vs-sequence kernel matrix Since k+ m(xi, xj) i,j=1,...,n is a Gram-matrix, it is directly amenable to large-scale variants of low-rank type. Strategies of this kind include the incomplete Cholesky decomposition, Nystr om approximation, or the inducing point formalism in a Gaussian process framework. For n sequences, all strategies mentioned above require evaluation of at most an (n r) and an (r r) matrix (where r is a meta-parameter), which costs O((r + n) r ℓ2 m) elementary operations and O((r + n) r ℓ2) storage11. Any of these low-rank strategies reduces the complexity of the Gram matrix to be linear in n, but not in ℓsince the strategy is completely independent of how the discretized signature kernel was evaluated at a given pair (xi, xj) X+ X+. For the same reason, any improvements on the cost of evaluating k+ m(xi, xj) will combine with this strategy. 6.3.2. Low-rank methods for the element-vs-element kernel matrix The evaluation k+ m(xi, xj) for a pair (xi, xj) X+ is given by nested partial summations and multiplications applied to the (ℓj ℓj)-matrix (k(xi,r, xj,s))r,s where r = 1, . . . , ℓi, s = 1, . . . , ℓj and we denote xi = (xi,1, . . . , xi,ℓi), xj = (xj,1, . . . , xj,ℓj) X+. The low-rank methods of the previous paragraph cannot be applied naively to this (ℓi ℓj)- matrix: firstly, this matrix is in general non-symmetric; secondly, the summationand multiplication-type operations require in naive form at each recursion step access to the full (ℓi ℓj)-matrix. The first issue is easily addressed by replacing the respective symmetric decomposition with the analogous non-symmetric one. The second issue is much harder to deal with, but below we show that by working exclusively on low-rank factorizations in each recursion step this issue can also be dealt with. Definition 6.8 Let A be an (a b)-matrix. For U an (a r)-matrix, and V a (b r)-matrix, we say that (U, V ) is a low-rank presentation of A, of rank r, if r =1 Ui,r Vj,r With denoting the usual matrix multiplication this is equivalent to A = U V . 11. followed by a slightly modified variant of the learning algorithm itself which usually costs O((r + n) r2) elementary operations and storage (at most) instead of the unmodified variant which usually costs O(n3) (at most). Kir aly and Oberhauser The fact that A has a low-rank presentation of rank r does imply that A is of rank r or less (by equivalence of matrix rank with decomposition rank), but it does not imply that A is of rank exactly r. We state a number of straightforward but computationally useful Lemmas that show how low-rank matrices behave under shifting, summation and cumulative summation, componentwise addition and multiplication. This allows us to replace these operations on matrices in Algorithm 4 by the operations on low-rank matrices. Lemma 6.9 (Low-rank under cumulative summations and shifts) Let (U, V ) be a low-rank representation of rank r for an (a b)-matrix A. Then (i) Define A and A as A i,j = P i i Ai ,j and A i,j = P j j Ai,j . Define U as U i,r = P i i Ui ,r and V as V j,r = P j j Vj ,r. Then (U , V ) is a low-rank representation or rank r of A of rank and (U, V ) is a low-rank representation of rank r of A . (ii) Define A and A as A i,j = Ai+s,j and A i,j = Ai,j+s for s 1. Define U and V as U i,r = Ui+s,r1i+s a and V j,r = Vj+s,r1j+s b. Then (U , V ) is a low-rank representation of A of rank r and (U, V ) is a low rank representation of rank r of A . Proof This follows by spelling out the definition: for (i) i i Ai ,j = X r =1 Ui ,r Vj,r = r =1 U i,r Vj,r ; A i,j = Ai+s,j = r =1 Ui+s,r Vj,r = r =1 U i,r Vj,r . The statements for A follow similarly. Lemma 6.10 (Low rank under addition and multiplication) Let A1, A2 be (a b)- matrices such that (U, V ) is a low-rank representation of A1 of rank r1, and (R, S) is a low-rank representation of A2 of rank r2. Then (i) (U , V ) is a low rank representation of A1 + A2 of rank r1 + r2 where ( Ui,r , if 1 r r1 Ri,r r1 , if 1 + r1 r r1 + r2 , V j,r = ( Vj,r , if 1 r r1 Sj,r r1 , if 1 + r1 r r1 + r2 (ii) ( U, V ) is a low rank representation of12 A1 A2 of rank r1r2 where Ui,r = Ui,a Ri,b and Vj,r = Vj,a Sj,b and a, b are such that r = (a 1) r2 + b 1 with a {1, . . . , r1} and b {1, . . . , r2} 12. We denote the elementwise product of matrices with A B, that is (A B)i,j = Ai,j Bi,j. Kernels for sequentially ordered data Proof Since the (i, j)-entry of the matrix A1+A2 equals Pr1 r=1 Ui,r Vj,r+Pr2 r =1 Ri,r Sj,r and the (i, j)-entry of A1 A2 equals Pr1 r=1 Ui,r Vj,r Pr2 r =1 Ri,r Sj,r , the result follows by substituting the two sums over r {1, . . . , r1} and r {1, . . . , r2} by one sum over {1, . . . , r1+r2} resp. over {1, . . . , r1r2}. In view of Point (ii) of above Lemma we introduce new array notation: Notation 6.11 Let A1 be a k-fold array of size (n1 nk) and A2 be a k-fold array of size (n1 nk 1 n k). Write A1 A2 for the (n1 nk 1 (nk n k)) array with entries A1 A2[i1, . . . , ik] = A1[i1, . . . , ik 1, a]A2[i1, . . . , ik 1, b] where a, b are such that ik = (a 1) n2 + b 1 with a {1, . . . , n1} and b {1, . . . , n2}. Algorithm 4 Evaluation of k+ m, with low-rank speed-up. Input: Sequences x, y X+. A kernel k : X X R. A truncation degree m 1. An integer 1 r min(ℓx, ℓy). Output: An approximation to k+ m(x, y) 1: Set ℓx |x| 1 and ℓy |y| 1 2: Compute a low-rank representation (U, V ) of rank r, approximating the element-vselement matrix/array with entries K[i, j] = i,j k(x, y), i {1, . . . , ℓx}, j {1, . . . , ℓy} 3: Initialize an (m ℓx )-array B and an (m ℓy )-array C (* means that the size may change dynamically) 4: Set B[1| :, :] U and C[1| :, :] V 5: for d = 2 to m do 6: P B[d 1| +1, :] and Q C[d 1| +1, :] 7: Append an (ℓx 1)-array of 1 s to P and append an (ℓy 1)-array of 1 s to Q 8: Set B[d| :, :] (U P)[:, :] and C[d| :, :] (V Q)[:, :] 9: optional: simplify the low-rank presentation (B, C), reducing its rank 10: end for 11: Set R B[m|Σ, :] and S C[m|Σ, :] 12: Return 1 + (R S)[Σ] Algorithm 4 first approximates the element-vs-element (ℓx ℓy)-matrix K = ( i,j k(x, y))i,j by a low rank approximation (U, V ). It then uses the formula for k+ m given in Corollary 4.9 and updates the low-rank representation in each recursion step (starting from the innermost parenthesis): The first time line 6 is called it uses Lemma 6.9 to calculate a low-rank representation (P, Q) of the (ℓx ℓy)-matrix P i >i,j >j i ,j k(x, y) i,j from the low-rank approximation (U, V ) for K = ( i,j k(x, y))i,j; later calls of line 6 do the same but with K replaced by another matrix with low-rank factorization stored in B[d| :, :] and C[d| :, :]. Kir aly and Oberhauser The first time line 7 is called, it calculates a low-rank representation ( P, Q) of the (ℓx ℓy)-matrix (1 + (P Q )i,j)i,j by using (P, Q) and the trivial identity k=1 Pi,k Qj,k = k=1 Pi,k Qj,k where P and Q are given by adding a (r + 1)-th column consisting of 1 s. Note that the rank increases by one by going from (P, Q) to ( P, Q). Later calls do the same but with (P, Q) replaced by another matrix with low-rank factorization stored in B[d| :, :] and C[d| :, :]. Line 8 calculates a low-rank representation ( U, V ) for the elementwise product between the (ℓx ℓy)-matrices K = U V and P Q . By Lemma 6.10 it is given as Ui, r = Ui,a Pi,b and V r,j = Va,j Qb,j. Line 9 is optional, aiming at keeping the low-rank presentation small13. Line 11 uses that i 1,j 1 (R S )i,j = X k Ri,k Sj,k = X i 1 Ri,k)( X (Recall that R S denotes elementwise multiplication of arrays R, S and R S denotes the usual matrix multiplication if R, S are interpreted as matrices). The computational cost of Algorithm 4 is of the same order as the maximum size of B and C. That is, if ρ is the smallest integer such that at any time B requires ℓx m ρ space, and C requires ℓy m ρ space, then the computational complexity of Algorithm 4 is14 O((ℓx + ℓy) ρ m). 6.3.3. Simultaneous low-rank methods Algorithm 4 yields an efficient low-rank speed-up for computing a single entry k+ m(xi, xj). Thus total computational cost for the Gram matrix (k+ m(xi, xj))i,j {1,...,n} is O(n2 ℓ ρ m) and cost quadratic in the number of data points n may be prohibitive on large scale data. We address this by combining both low-rank strategies mentioned before to achieve a further reduction of computational cost from O(n2 ℓ ρ m) to O(n ℓ ρ m). 13. This can be achieved for example via singular value decomposition, sub-sampling, or random projection type techniques. 14. Since one can always choose a low-rank representation of B[m| :, :] and C[m| :, :] of rank min(ℓx, ℓy) or less, the computational complexity is bounded by O(ℓx ℓy m) (which is the complexity of Algorithm 3). Kernels for sequentially ordered data Algorithm 5 Computation of the Gram matrix of k+ m, with (double) low-rank speed-up. Input: Sequences x1, . . . , xn X+. A kernel k : X X R. A truncation degree m. An integer r 1. Output: A low-rank approximation (U, U) of the Gram matrix K = k+ m(xi, xj) i,j=1,...,n. 1: Compute arrays U(i) such that each pair (U(i), U(j)) i, j {1, . . . , n} is a low-rank presentation of rank r for the ((|xi| 1) (|xj| 1))-matrix K(ij) with entries K(ij) a,b = a,b k(xi, xj) 2: Initialize an (n m )-array B (where * means that the sizes may change dynamically) 3: B[i|1| :, :] U(i) for all i {1, . . . , n} 4: for d = 2 to m do 5: Compute P B[: |d 1| +1, :] 6: Set κ, ρ such that (n κ ρ) is the size of P 7: Append an (n κ 1)-array of ones to P 8: B[: |d| :, :] B[: |1| :, :] P[:, :, :] 9: optional: simplify the low-rank presentation encoded in B, reducing its rank 10: end for 11: Compute U B[: |m|Σ, :] 12: Return U Algorithm 5 is obtained as follows: The previous Algorithm 4 transforms for any pair xi, xj X+ a low rank representation (U(ij), V (ij)) of the (ℓi ℓj)-matrix K(ij) = ( a,b k(xi,a, xj,b))a,b into a low-rank representation (R(ij), S(ij)) such that k+(xi, xj) = 1 + X R(ij) (S(ij)) But if U(ij) is independent of j and if V (ij) is independent i, then R(ij) will depend by Algorithm 4 only on i and S(ij) will depend only on j; hence R(ij) = S(ji). If we denote R(i) := R(ij) := S(ji), then in above Algorithm 5, B[i, m, a, r] equals R(i) a,r when the end of the for loop is reached, line 10. k+ m(xi, xj) = 1 + X R(i) (R(j)) a 1 R(i) a,k X b 1 R(j) b,k k=1 Ri,k Rj,k where we denote with r the rank of (R(i), R(j)) and set Ri,k := P a 1 R(i) a,k and Ra,k := 1 if k = r +1. Hence, ( R, R) is a low-rank representation of the (n n)-Gram matrix K = (k+ m(xi, xj))i,j. In Algorithm 5, the array entry U[i, r ] equals Ri,r , line 11. Kir aly and Oberhauser Remark 6.12 Line 1 requires that for each pair of sequences xi = (xi,a)a=1,...,ℓi, xj = (xj,b)b=1,...,ℓj X+ we compute the (ℓi ℓj)-matrix K(ij) with entries K(ij) a,b = k(xi,a, xj,b). That is, the matrices U(i), when row-concatenated, should have low rank15. Jointly low-rank U(i) can be obtained by running a suitable joint diagonalization or singular value decomposition scheme on the matrices K(ij). Remark 6.13 (Fast sequential kernel methods) By Section 5, fast string kernel methods such as the gappy, substitution, or mismatch kernels presented in Leslie and Kuang (2004) may be transferred to general sequential kernels. In general, this amounts to small modification of Algorithm 3; for example, to obtain a gappy variant of the sequential kernel, summation in line 6 of Algorithm 3 over the whole matrix, of quadratic size, is replaced by summation over a linear part of it. 7. Experimental validation We perform two experiments to validate the practical usefulness of the signature kernels: (1) On a real world data set of hand movement classification (eponymous UCI data set Sapsanis et al. (2013)), we show the discretized signature kernel outperforms the best previously reported predictive performance Sapsanis et al. (2013), as well as non-sequential kernel and aggregate baselines. (2) On a real world data set on hand written digit recognition (pendigits), we show that the discretized signature kernel over the Euclidean kernel (= linear use of signature features) achieves only sub-baseline performance. Using the discretized signature kernel over a Gaussian kernel improves prediction accuracy to the baseline region. We emphasize that our experiments do not constitute a systematic benchmark comparison to prior work, only validation that the signature kernel is a practically meaningful concept: experiment (1), shows that the sequentialization of standard kernels can achieve state-of-the-art performance on time series data; experiment (2) shows that even for paths in low dimensions, X = R2, using a non-linear static kernel for the sequentialization outperforms the linear kernel (the latter corresponds to learning with signature features of a 2-dimensional path; the former, corresponds to learning with signature features of the path lifted to the RKHS H of the non-linear kernel). A systematic benchmark comparison is likely to require a larger amount of work, since it would have to include a number of previous methods (multiple variants of the string and general alignment kernels, dynamic time warping, naive use of signatures), for most of which there is no freely available code with interface to a machine learning toolbox, and benchmark methods (order-agnostic baselines such as summary aggregation and chunking; distributional regression; naive baselines). 15. For example, if k is the Euclidean scalar product, U (i) can be taken as the raw data matrix, where rows are different time points and columns are features. Kernels for sequentially ordered data 7.1. Validation and prediction set-up 7.1.1. Prediction tasks In all data sets, samples are multi-variate (time) series. All learning tasks are supervised classification tasks of predicting class labels attached to series of equal length. 7.1.2. Prediction methods For prediction, we use eps-support vector classification (as available in the python/scikitlearn package) on the kernel matrices obtained from the following kernels: (1.a) the Euclidean kernel k(x, y) = x, y . This kernel has no parameters. (1.b) the Gaussian kernel k(x, y) = exp 1 2γ2 x y 2 . This kernel has one parameter, a scaling constant γ. (2.a) the (truncated) discretized signature kernel k+ m over the linear/Euclidean kernel k(x, y) = γ x, y . This kernel has two parameters, a scaling constant γ, and the truncation degree m. (2.b) the (truncated) discretized signature k+ m over the Gaussian kernel k(x, y) = θ exp 1 2γ2 x y 2 . This kernel has three parameters: scaling constants γ and θ, and truncation degree m. (1.a) and (1.b) are considered standard kernels, (2.a) and (2.b) are discretized signature kernels. Note that the kernels (1.a) and (1.b) can only be applied to sequential data samples of equal length which is the case for the data sets considered. Even though (1.a), (1.b) may be applied to sequences of same length, they do not use any information about their ordering: both the Euclidean and the Gaussian kernel are invariant under (joint) permutation of the order of the indexing in the arguments. Another subtlety is that the discretized signature kernels (2.a), (2.b) do use information about the ordering of the sequences, but only for a truncation m 2. For m = 1, the kernel corresponds to choosing the increment/mean aggregate feature (Euclidean) or a type of distributional classification (Gaussian). We will therefore explicitly compare truncation degrees 1 versus 2 and higher, to enable us to make a statement about whether using the order information was beneficial (or not). 7.1.3. Tuning and error estimation In all experiments, we use nested (double) cross-validation for parameter tuning (inner loop) and estimation of error metrics (outer loop). In both instances of cross-validation, we perform uniform 5-fold cross-validation. Unless stated otherwise, parameters are tuned on the tuning grid given in Table 3 (when applicable). Kernel parameters are the same as in the above section prediction mehods . The best parameter is selected by 5-fold cross-validation, as the parameter yielding the minimum test-f1-score, averaged over the five folds. Kir aly and Oberhauser parameter range kernel param. γ 0.01, 0.1, 1 kernel param. θ 0.01, 0.1, 1 truncation degree m 1,2,3 SVC regularizer 0.1, 1, 10, 100, 1000 Table 3: Tuning grid 7.1.4. Error metrics The out-of-sample classification error is reported as precision, recall, and f1-score of out-ofsample prediction on the test fold. Errors measures are aggregated with equal weights on classes and folds. These aggregates are reported in the result tables. 7.2. Experiment: Classifying hand movements We performed classification with the eps-support vector machine (SVC) on the hand movements data set from UCI Sapsanis et al. (2013). The first database in the data set which we considered for this experiment contains, for each of five subjects (two male, three female) 180 samples of hand movement s EMG recordings. Each sample is a time series in two variables (channels) at 3.000 time points. The time series fall into six classes of distinct types of hand movement (spherical, tip, palmar, lateral, cylindrical, hook). For each subject, 30 samples of each class were recorded. Hence, for each subject, there is a total of 180 sequences in X3000 with X = R2. For each of the five subjects, we conducted the classification experiment as described in Section 7.1, comparing prediction via SVC using one of the following kernels: (1.a) the Euclidean kernel, (1.b) the Gaussian kernel, (2.a) the sequentialized Euclidean kernel. For the non-sequential kernels (1.a), (1.b), prediction was performed with and without prior standardization of the data. For the sequential kernel, the tuning grid was considered in two parts: a truncation degree of m = 1, corresponding to mean aggregation, and truncation degrees of m = 2, 3, corresponding to the case where genuine sequence information is used. Further, we use the low-rank speed, Algorithm 5, to compute the Gram matrix. The results are reported in Tables 4 to 8. Jackknife standard errors (pooling the five folds) are all 0.04 or smaller. Baseline performance of an uninformed estimator is 1/6 0.17. One can observe that for all five subjects, SVC with sequential kernel of degree 2 or higher method precision recall f1-score (1.a) linear 0.37 0.38 0.36 (1.a) linear, standardized 0.33 0.32 0.29 (1.b) Gaussian 0.57 0.59 0.56 (1.b) Gaussian, standardized 0.54 0.50 0.50 (2.a) mean aggregation 0.19 0.20 0.18 (2.a) sequential, degree 2 0.87 0.86 0.86 Table 4: female1.mat outperforms SVC using any of the other kernels not using any sequence information. The sequence kernel outperforms the reported methods from the original paper Sapsanis et al. (2013) as well (Figures 11 and 12). Kernels for sequentially ordered data method precision recall f1-score (1.a) linear 0.47 0.39 0.37 (1.a) linear, standardized 0.31 0.28 0.27 (1.b) Gaussian 0.71 0.71 0.70 (1.b) Gaussian, standardized 0.59 0.58 0.56 (2.a) mean aggregation 0.18 0.20 0.18 (2.a) sequential, degree 2 0.94 0.97 0.95 Table 5: female2.mat method precision recall f1-score (1.a) linear 0.48 0.46 0.46 (1.a) linear, standardized 0.47 0.42 0.43 (1.b) Gaussian 0.66 0.64 0.63 (1.b) Gaussian, standardized 0.54 0.51 0.50 (2.a) mean aggregation 0.26 0.23 0.20 (2.a) sequential, degree 2 0.96 0.96 0.96 Table 6: female3.mat method precision recall f1-score (1.a) linear 0.37 0.33 0.33 (1.a) linear, standardized 0.38 0.36 0.36 (1.b) Gaussian 0.59 0.57 0.57 (1.b) Gaussian, standardized 0.53 0.54 0.53 (2.a) mean aggregation 0.20 0.18 0.17 (2.a) sequential, degree 2 0.96 0.96 0.96 Table 7: male1.mat method precision recall f1-score (1.a) linear 0.36 0.33 0.32 (1.a) linear, standardized 0.37 0.29 0.27 (1.b) Gaussian 0.72 0.71 0.70 (1.b) Gaussian, standardized 0.34 0.39 0.35 (2.a) mean aggregation 0.22 0.23 0.20 (2.a) sequential, degree 2 0.93 0.93 0.93 Table 8: male2.mat 7.3. Experiment: Pendigits We performed classification on the pendgits data set from the UCI repository16. It contains 10992 samples of digits between 0 and 9 written by 44 different writers with a digital pen on a tablet. One sample consists of a pair of horizontal and vertical coordinates of sampled at 8 different time points, hence we deal with a sequence in X8 with X = R2. The data set comes with a pre-specified training fold of 7494 samples, and a test fold of 3498 samples. Estimation of the prediction error is performed in this validation split, while tuning is done as described via nested 5-fold cross-validation, inside the pre-specified training fold. We compared prediction via SVC using one of the following three kernels: (2.a) the sequentialized Euclidean kernel, and (2.b) the sequentialized Gaussian kernel. For both, the truncation degree was set to m = 4. The results are reported in Table 9. Jackknife standard errors (pooling the five folds) are all 0.01 or smaller. Baseline performance of an uninformed estimator is 1/10 0.10. 16. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits Kir aly and Oberhauser method\method precision recall f1-score sequential, linear 0.91 0.90 0.89 sequential, Gaussian 0.97 0.97 0.97 Table 9: Pendigits The quality of the SVC prediction with the sequentialzed linear kernel is comparable to results reported in Diehl (2013). It is outperformed by SVC prediction with the sequentialized Gaussian kernel. The latter performance is similar to the baseline performance of k-nearest neighbors reported in the documentation of the pendigits data set. Acknowledgments FK acknowledges support by the Alan Turing Institute, under EPSRC grant EP/N510129/1. HO is grateful for support by the Oxford-Man Institute of Quantitative finance and would like to thank Ilya Chevyrev for a careful proofreading and helpful remarks and discussions. Appendix A. Signatures, and proofs of Theorem 1 and Theorem 3 A.1. Signatures Definition A.1 Let E be a normed space and denote with L (H, E) the set of continuous linear maps from H to E. Given x C1([0, 1], H) and y C1 ([0, 1] , L (H, E)), the Riemann Stieltjes integral of y over [a, b] [0, 1] is defined as the element in E given as a y dx := lim mesh(π) 0 i=1 y(ti) (x(ti+1) x(ti)) where the limit is taken over all partitions π = {a t1 < < tl b} and mesh(π) := maxi=1,...,l 1 |ti+1 ti|. We also use the shorter notation y dx if the integration domain is clear from the context. See (Lyons, 2004, Theorem 1.16) for a proof of a more general result. Applying the above with E being the subspace of Q m 0 H m of square summable tensors and the linear map being tensor multiplication, gives the series S(x) := (Sm(x))m 0 := ˆ dx m with dx 0 := 1 and dx (m+1) := dx m. A.2. Proof of Theorem 1 This is more or less a folk theorem in rough path theory, but we recall it for readers not familiar with the subject. Point (1) of Theorem 1 follows since S(h) is itself a solution of a linear ODE driven by h and this solution map is continuous, see Lyons (2004). Point (2) is Kernels for sequentially ordered data more involved and we refer to Hambly and Lyons (2010). For the last point, Point (3), the key insight is that the series S(h) behaves like monomials. Theorem 5 (Shuffle product) Let h C1([0, 1], H), then ˆ dh m ˆ dh m = X σ σ ˆ dh (m+m ) . Here the sum is taken over all ordered shuffles OSm,m which are defined as σ : σ permutation of 1, . . . , m + m , σ (1) < < σ (m) , σ (m + 1) < < σ m + m . and σ OSm,m acts on H (m+m ) as σ ei1 eim+m = eσ(i1) eσ(im+m ). The proof of above is given in (Lyons, 2004, Theorem 2.29 (ii)). A direct consequence by applying Stone Weierstrass (in analogy to classic monomials) is that linear functionals of signatures approximate nonlinear functions of paths aribtrary well, thus showing Point (3) of Theorem 1. A.3. Proof of Theorem 3 and Corollaries 4.3 and 4.4 Definition A.2 Denote with ℓ [a,b] the n-simplex over the interval [a, b], that is ℓ [a,b] = {(ti)ℓ i=1 : a = t1 < < tℓ= 1}, and with [a,b] = S ℓ 2 ℓ [a,b]. If the interval [a, b] is clear from the context is clear, then we just write ℓand . The following notation for sequences becomes useful. Notation A.3 For n 1 denote 1. [n] := {1, . . . , n}, 2. sequences in [n] as i = (i1, . . . , il) [n]l and call |i| = l the length of the sequence i, 3. d(i) := max{r : i1 = i2 = = ir} the number of repetitions of the first element in the sequence i where by convention d(i) = 0 if i1 = i2, 4. i! := n1! nk! if i consists of k = |{i1, . . . , il}| different elements in [n] and n1, . . . , nk denote the number of times they occur in i, 5. i [n] if i = (i1, . . . , il) [n]l and i1 < < il, 6. i [n] if i = (i1, . . . , il) [n]l and i1 il, 7. i d [n] if i [n] and no element in the sequence i appears more than than d times. Kir aly and Oberhauser Lemma A.4 Let h C1([a, b], H), π ℓ [a,b] and define hπ = (h(ti))i=1,...,ℓ H+. For any sequence i = (ik)k=1,...,m [ℓ]m with 1 i1 im ℓdenote i := m 1 k=1 [tik, tik+1]. Then Sm(h) S+ m(hπ) = X where the sum is taken over all i = (ik)m k=1 with 1 i1 im ℓthat have at least one repeating index, that is ik = ik+1 for at least one 1 k m 1. Moreover, each term in above sum can be bounded as follows i i h|[ti,ti+1] 1. where we use i! as introduced in Notation A.3. Proof We can decompose into a disjoint union of i over all sequences i, with 1 i1 im ℓ, hence Sm(h) = ˆ dh m = X Split the sum P i into a sum over i that have an repeating element, i.e. ik = ik+1 for at least one k, and a sum P i over indices that do not have repeating elements. For the latter, the integration can be done explicitly and this sum equals X with our usual notation ih = h(ti+1) h(ti). However, this is exactly S+ m(h), hence the first statement follows. For the first sum over sequences i with repeating elements, denote with i1, . . . , ik the distinct indices in i, and n1, . . . , nk the total counts of their respective occurrences. Then i = k j=1 j with j := nj [t ij, t ij+1]nj and it follows that Therefore, we obtain j dh nj H 1 i i h|[ti,ti+1] 1. This is enough to prove our main approximation result: Theorem 6 Let h C1([0, 1], H) let π ℓ. Then for every m 0 Sm(h) S+ m(hπ) H m gm Kernels for sequentially ordered data where (gm) are the coefficients in the Taylor expansion around 0 of the function m=1 gm zm := exp(z h 1) 1 + z h|[ti,ti+1] 1 S(h) S+(hπ) exp( h 1) 1 + h|[ti,ti+1] 1 If π is chosen such that h|[ti,ti+1] 1 = 1 ℓ 1 h 1 for all i = 1, . . . , ℓ 1, then S(h) S+(h) exp( h 1) 1 + h ℓ 1 1 (ℓ 3)! Proof The bound for Sm(h) S+ m(hπ) H m follows from Lemma A.4 by explicitly writing out the coefficient gm. Applying the triangle inequality and using this bound and one obtains S(h) S+(hπ) m=1 Sm(h) S+ m(hπ) H m m=1 gm = g(1) Finally, for the special choice of a uniform partition π g(1) = exp h 1 1 + h 1 to which we apply Euler s approximation theorem, Theorem 7, below using that h|[a,b] 1 + h|[b,c] 1 = h|a,c 1. The first part of Theorem 6 implies Theorem 3, the second part implies Corollary 4.4. For readers convenience we recall Euler s well-known approximation to the exponential function. Theorem 7 Let x R, n N. Then, n exp(x) = g(x, n), with |g(x, n)| exp(x) In particular, it holds that n = exp(x), where convergence is uniform of order O(n 1) on any compact subset of R. Proof All statements follow from the first, which we proceed to prove. By the binomial theorem, it holds that 1 + x Kir aly and Oberhauser From the definition of the binomial coefficient and an elementary computation, one obtains k! + g(x, n, k), where g(x, n, k) xk for k n. For k n, one has xk Putting together all inequalities and using the Taylor expansion of exp yields the claim. To prove Corollary 4.3 we need to generalize Euler s approximation of exp x to the case when x is not divided into uniform parts, but potentially very unbalanced parts. This is done in the proposition below Proposition A.5 Let x R, n N, x 0. Let x1, . . . , xℓ R, xi 0 for i = 1, . . . , ℓsuch that Pℓ i=1 xi = x. Then, i=1 (1 + xi) + g(x, x1, . . . , xℓ), where 0 g(x, x1, . . . , xℓ) x exp(x) max i=1,...,ℓxi. In particular, it holds that lim maxi xi 0 i (1 + xi) = exp(x), where convergence is uniform of order O(maxi xi) on any compact subset of [0, ). Proof All statements follow from the first, which we proceed to prove. We use Notation A.3 Writing out the product, we obtain i=1 (1 + xi) = X where abbreviatingly we have written xi := Q i i xi. The Taylor expansion of the exponential on the other hand yields exp(x) = exp Note the major different between both sums above being the repeating indices which may occur in the expansions of exp(x). More precisely, we obtain i=1 (1 + xi) = X Kernels for sequentially ordered data We further split up the sum by length of i: i=1 (1 + xi) = i [ℓ] |i|=m i!>1 Positivity of g follows from this equation and positivity of x. Now consider the map φ which removes the first duplicated index in an ordered index sequence i yielding a sequence of length |i| 1. On sequences of length m, the map φ is at most m-to-one, and surjective onto sequences of length m 1. Therefore, i [ℓ] |i|=m i!>1 i [ℓ] |i|=m 1 where X = maxi xi. Thus, i [ℓ] |i|=m i!>1 i [ℓ] |i|=m Comparing to the expansion of exp(x) above, one observes that the right hand side is equal to m! = X x exp(x). The bounds in Proposition A.5 are worse than those from Euler s classic approximation result to the exponential in the case of equal xi, by a factor of x. This is due to the fact that the bound also needs to be valid for heavily imbalanced partitions of x into xi. Appendix B. Higher order signature kernels and noisy observations The discretized signature kernel was defined as inner product k+ : X+ X+ R k+(x, y) = S+(kx), S+(ky) of features S+ that approximate the signature S, thus k+ will approximate the signature kernel k : PX PX R k (x, y) = S(kx), S(ky) when the sequences are discretizations of paths: in Section 4 we showed under Assumption 3.1 that k+(xπ, yπ ) k (x, y) as max(mesh(π), mesh(π )) 0. Kir aly and Oberhauser For essentially all practically relevant kernels k : X X R Assumption 3.1 holds whever the underlying paths x, y are of bounded variation. However, a common situation is that observations are perturbed by noise which are generically of unbounded variation. Example B.1 Let X = R and k(x, y) = x, y R (that is H = R). If x C1([0, 1], X) and B = (Bt)t [0,1] is a Brownian motion in X then S(kx+B) S(x + B) = d(x + B) m is not well-defined as a Riemann Stieltjes integral. A general strategy is to replace the signature map x 7 S(x) by a a map from paths to Q m 0 H m as follows: construct a sequence of bounded variation paths (xn) that approx- imation x and such that t 7 t 0 dx m n converges to a Q m 0 H m-valued path which is called the (geometric) rough path lift of x. Example B.2 Constructions for mapping a path to an element of Q m 0 H m are known for a wide range of stochastic processes; below we mention a few Brownian motion (leading to p-rough paths for any p > 2 ) more generally, continuous Semimartingale (leading to p-rough paths for any p > 2), fractional Brownian motion of Hurst parameter H > 1 more generally, Gaussian processes (leading to p-rough paths where p depends on the regulary of the covariation process), Markov processes in continuous time (leading to p-rough paths with p depending on the generator of the Markov process). For details and more examples see Lyons (2004). The notion of a geometric rough path allows to study classes of much rougher paths by the same approach we developed for the bounded variation case: below we provide the needed modifications for obtaining a sequentialized kernel such that k+(xπ, yπ ) is still well-defined as the mesh vanishes when x, y are paths of unbounded variation but a rough path lift exists (e.g. all the stochastic processes in Example B.2) Remark B.3 A more radical approach is to assume that the we are not only given paths as data but also the first levels of their rough path lift , see for example Crisan et al. (2013). The same recursion applies to give a kernel, however, we do not spell out details since all benchmark data sets we are aware of, only provide path increments. Definition B.4 Let d N. We call the map S+ (d) :H+ Y m 0 H m, (hi)ℓ i=1 7 the discretized signature map of degree d. We denote with S+ (d,m) the projection of S+ (d) to Lm n=0 H n. Kernels for sequentially ordered data Definition B.5 Let k : X X R and d, m 1, d m. The discretized signature kernel over k of order d and at degree m is defined as k+ (d,m)(x, y) : X+ X+ R, k+ (d,m) (x, y) = S+ (d,m) (kx) , S+ (d,m) (ky) . The sequentialization k+ m from Section 4 arises as special case of above, more general, definition: k+ m = k+ (d,m) for d = 1, m N. Readers familiar with rough paths will notice that the choice m = d = p recovers the notion of (kx(t))t [0,1] lifted to a geometric p-rough path. However, for machine learning applications it is often beneficial to take d = p to ensure convergence but to consider m > d and find the optimal degree m from the data (e.g. by cross-validation). In analogy to the order d = 1 approximations discussed in Section 4, the central mathematical identity is now (x(ti+1) x(ti)) m r=1 (x(tir+1) x(ti)) ˆ dx m where we use Notation A.3. Proposition B.7 For x, y X+, d, m 1, m d, (a) S+ (d,m)(kx) = 1 + Pm n=1 P i d[|x| 1] |i|=n 1 i! Q|i| r=1 i kx, (b) k+ (d,m)(x, y) = 1 + Pm n=1 P i d[|x| 1] j d[|y| 1] |i|=|j|=n 1 i!j! Q|i| r=1 ir,jr k(x, y). where i kx := kxi+1 kxi H. Proof By definition S+ (d)(kx) = 1 n!( r kx) n. Application of the (non-commutative) associative law yields 1 n!( r kx) n = X r=1 ir kx . Using the analogous expression for S+ (d)(y), taking the scalar product while noting r =1 kjr = δ|i|,|j| r=1 kir, jr ky . and truncating at tensor degree m and restricting to i d [|x| 1] and j d [|y| 1] yields the claim. Kir aly and Oberhauser Remark B.8 The appearance of the i! term together with i d [|x| 1] makes a recursion formula more complex since one needs to keep track how many elements in i are equal in every recursion step. We give an effective algorithm in Section B.1 that relies on multi-way recursion. B.1. Computing the higher order discretized signature kernel Algorithm 6 Evaluation of k+ (d,m). Input: Sequences x, y X+. A kernel k : X X R. A truncation degree m. An approximation order d with 1 d m. Output: k+ (d,m)(x, y) 1: Let ℓx |x| 1 and ℓy |y| 1 2: Compute an (ℓx ℓy) array K such that K[i, j] = i,j k(x, y) 3: Initialize an (m d d ℓx ℓy)-array A, all entries zero 4: for n = 2 to m do 5: d min(d, n 1) 6: A[n|1, 1| :, :] K (1 + A[n 1|Σ, Σ| +1, + 1]) 7: for r = 2 to d do 8: A[n|r, 1| :, :] 1 r K A[n 1|r 1, Σ| :, + 1] 9: A[n|1, r| :, :] 1 r K A[n 1|Σ, r 1| +1, :] 10: for s = 2 to d do 11: A[n|r, s| :, :] 1 rs K A[n 1|r 1, s 1| :, :] 12: end for 13: end for 14: end for 15: Compute R 1 + A[m|Σ, Σ|Σ, Σ] 16: Return R Recall that multiplications of arrays in Algorithm 6 are entry-wise (not matrix multiplications). At the end of Algorithm 6, the array A contains as elements A[m|r, s|i, j] the contributions from sub-sequences i [ℓx], j [ℓy], beginning at i and j, with start-sequences iii . . . of length r and jjj . . . of length s, and of total length at most m. Proposition B.9 Let x, y X+, m 1 and d with 1 d m. Define ℓx = |x| 1, ℓy = |y| 1 and An,r,s i,j := X i d[ℓx] j d[ℓy] |i|=|j|=n κ=1 iκ,jκ k(xiκ, yjκ) for every r, s 1 and i [ℓx], j [ℓy] where the sum over i = (i1, . . . , in), j = (j1, . . . , jn) is additionally restricted in the following way: i = i1 = i2 = = ir = ir+1 and j = j1 = j2 = = js = js+1. Kernels for sequentially ordered data By convention, set An,r,s i,j = 0 if r < n or s < n. Then, k+ (d,m)(x, y) = 1 + s=1 An,r,s i,j (5) and the following equalities hold An,r,s i,j = 1 rs i,j k(x, y) An 1,r 1,s 1 i,j , (6) An,1,1 i,j = i,j k(x, y) s=1 An 1,r,s i ,j An,r,1 i,j = 1 r i,j k(x, y) X s=1 An 1,r 1,s i,j , (8) An,1,s i,j = 1 s i,j k(x, y) X r=1 An 1,r,s 1 i ,j , (9) for r, s 2, i [ℓx], j [ℓy]. Proof By Proposition B.7 k+ (d,m)(x, y) = 1 + i d[ℓx] j d[ℓy] |i|=|j|=n r=1 ir,jr k(x, y) i d[ℓx] j d[ℓy] 1 i!j! i1,j1 k(x, y) (1 + i2,j2 k(x, y) (1 + (1 + im,jm k(x, y))) ) and the first equality shows the identity (5) by explicitly summing over that starting point of sequences and the number of repeated elements. Equality (6) follows directly by definition of An,r,s i,j and that for i = (i1, i2, . . . , in) with d(i) = r and i = (i2, . . . , in) we have i! = 1 r(i !). Equality (7) follows since {i = (i1, . . . , in) : d(i) = 0, i1 = i} = {(i, i ) : i = (i 1, . . . , i n 1), i 1 > i} r=0 {(i, i ) : i = (i 1, . . . , i n 1), i 1 > i, d(i ) = r} where in above equalities we implicitly additionally assume that all indices i, i are increasing (that is, i1 in, i 1 i n 1). Equalities (8),(9) follow similarly. Remark B.10 The computational cost of Algorithm 6 is O(d2m|x||y|) elementary arithmetic operations (= the number of loop elements) and O(d2|x||y|) units of elementary storage (when freeing up space for array entries directly after the last time they are read out). Remark B.11 The higher order Algorithm 6 can be combined with the low-rank techniques analogous to the treatment of the order d = 1 algorithm in Section 6 by applying the lowrank representation to the matrices/2D-arrays A[m|i, j| :, :]. Since all assignments and manipulations can be re-phrased in those matrices, the same strategy applies. Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition with support vector machines-a kernel approach. In Frontiers in handwriting recognition, 2002. proceedings. eighth international workshop on, pages 49 54. IEEE, 2002. Dan Crisan, Joscha Diehl, Peter Friz, and Harald Oberhauser. Robust filtering: multdimensional noise and multidimensional observation. Annals of Applied Probability, 23(5): 2139 2160, 2013. Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1 49, 2002. Marco Cuturi. Fast global alignment kernels. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 929 936, 2011. Marco Cuturi, J-P Vert, Oystein Birkenes, and Takashi Matsui. A kernel for time series based on global alignments. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 2, pages II 413. IEEE, 2007. Joscha Diehl. Rotation invariants of two dimensional curves based on iterated integrals. Co RR, abs/1305.6883, 2013. URL http://arxiv.org/abs/1305.6883. Toni Giorgino. Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31(7):1 24, 2009. Ben Hambly and Terry J Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Annals of Mathematics, 171(1):109 167, 2010. David Haussler. Convolution kernels on discrete structures. Technical report, Citeseer, 1999. Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201 237, 1983. Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 2004. Daniel Levin, Terry J Lyons, Hao Ni, et al. Learning from the past, predicting the statistics for the future, learning an evolving system. ar Xiv preprint ar Xiv:1309.0260, 2013. Huma Lohdi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classification using string kernels. Journal of Machine Learning Research, 2002. Terry J Lyons. Differential equations driven by rough paths. Springer Berlin Heidelberg New York, 2004. Terry J Lyons, Hao Ni, and Harald Oberhauser. A feature set for streams and an application to high-frequency financial tick data. In Proceedings of the 2014 International Conference on Big Data Science and Computing, page 5. ACM, 2014. Hiroshi Shimodaira Ken-ichi Noma. Dynamic time-alignment kernel in support vector machine. Advances in neural information processing systems, 14:921, 2002. Anastasia Papavasiliou, Christophe Ladroue, et al. Parameter estimation for rough differential equations. The Annals of Statistics, 39(4):2047 2073, 2011. Hiroaki Sakoe. Two-level dp-matching a dynamic programming-based pattern matching algorithm for connected word recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on, 27(6):588 595, 1979. Hiroaki Sakoe and Seibi Chiba. A similarity evaluation of speech patterns by dynamic programming. In Nat. Meeting of Institute of Electronic Communications Engineers of Japan, page 136, 1970. Christos Sapsanis, George Georgoulas, and Anthony Tzes. EMG based classification of basic hand movements based on time-frequency features. In Control & Automation (MED), 2013 21st Mediterranean Conference on, pages 716 722. IEEE, 2013. Bernhard Sch olkopf and Alexander J Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. Kilho Shin and Tetsuji Kuboyama. A generalization of Haussler s convolution kernel: mapping kernel. In Proceedings of the 25th international conference on Machine learning, pages 944 951. ACM, 2008. Weixin Yang, Lianwen Jin, and Manfei Liu. Character-level chinese writer identification using path signature feature, dropstroke and deep CNN. abs/1505.04922, 2015. URL http://arxiv.org/abs/1505.04922.