# kernels_for_sequentially_ordered_data__3bcf8849.pdf

Journal of Machine Learning Research 20 (2019) 1-45 Submitted 6/16; Revised 2/18; Published 2/19

Kernels for Sequentially Ordered Data

Franz J. Kir aly f.kiraly@ucl.ac.uk Department of Statistical Science University College London London WC1E 6BT, United Kingdom

Harald Oberhauser oberhauser@maths.ox.ac.uk Mathematical Institute University of Oxford Oxford OX2 6GG, United Kingdom

Editor: Le Song

We present a novel framework for learning with sequential data of any kind, such as multivariate time series, strings, or sequences of graphs. The main result is a sequentialization that transforms any kernel on a given domain into a kernel for sequences in that domain. This procedure preserves properties such as positive deﬁniteness, the associated kernel feature map is an ordered variant of sample (cross-)moments, and this sequentialized kernel is consistent in the sense that it converges to a kernel for paths if sequences converge to paths (by discretization). Further, classical kernels for sequences arise as special cases of this method. We use dynamic programming and low-rank techniques for tensors to provide eﬃcient algorithms to compute this sequentialized kernel. Keywords: Sequential data, kernels, signature, ordered moments, signature kernels

1. Introduction

Sequentially ordered data are ubiquitous in modern science, occurring as time series, location series, or, more generally, sequentially observed samples of structured objects. Two stylised facts make learning with sequential data an ongoing challenge:

(A) Sequential data is very diverse, e.g. it includes sequences of diﬀerent length of scalars, letters, images, graphs (in time series, text, video, network evolution) or even heterogeneous combinations of these. This diversity is usually addressed by ad-hoc approaches, extensive pre-processing and manual extraction of features, thus speciﬁc to the data at hand.

(B) Sequential data is often large, with sequences easily obtaining the length of hundreds, thousands, millions. Hence, the data sets quickly become very huge, and with them computational cost.

In this paper, we present a novel approach to learning with sequential data based on joining ideas from stochastic analysis and kernel learning, addressing the points above:

c 2019 Franz J. Kir aly and Harald Oberhauser.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/16-314.html.

Kir aly and Oberhauser

(a) The (discretized) signature is a canonical feature map for sequences. It can intuitively be described as an ordered version of sample moments. It completely describes a sequence and makes sequences of diﬀerent size and length comparable.

(b) The kernel trick applied to signature features avoids the combinatorial explosion of coordinates inherent to the signature feature map (in analogy to the classic polynomial kernel in Rd).

The main results of this article can be described as follows: if X is a topological space ( structured objects ) and we denote with X+ = S ℓ 1 Xℓthe set of arbitrary length sequences in X ( sequences of structured objects ) then we construct a map that takes

a kernel k : X X R to a kernel k+ : X+ X+ R.

We call k+ the discretized signature kernel (or sequentialization) over k and provide eﬃcient algorithms to evaluate k+. This transformation is canonical in the sense that it preserves important properties of k, such as positive deﬁniteness and it is consistent in the sense that k(x, y) for x, y X+ converges to an inner product of a feature map for paths that is classic in stochastic analysis (the path signature) if the sequences x, y converge to paths (from discrete to continuous time). Classical kernels for sequences such as string, global alignment or relation-convolution kernels can be identiﬁed as special cases of the discretized signature kernel k+; for example string kernels arise by taking X to be a ﬁnite set of letters . More interestingly, even for X = R this yields new kernels for time series by the sequentialization of classic non-linear kernels such as Gaussian, Laplace, etc. From a methodological point of view, this gives a canonical way to transform static features (resp. kernels k) for structured objects into features (resp. kernels k+) for evolving structured objects.

1.1. Kernels for paths

We ﬁrst discuss the situation for path-valued observations (continuous time) rather than sequence-valued observations (discrete time) the latter then follows by discretizing time. That is, if we denote with PX X[0,1] a suﬃciently regular subset of the space of paths that evolve in X (without loss of generality the paths are parameterized by [0, 1]), our aim is to transform a kernel k : X X R into a kernel k : PX PX R.

Therefore, ﬁrst recall that X u 7 ku := k(u, ) H RX can be thought of as the (canonical) feature map induced by k. It is then natural to proceed as follows:

1. Lift a path evolving in data space X to a path evolving in feature space H. We lift the path x that evolves in data space X to a path1 kx that evolves in the feature space H via the kernel k. That is, we map

x = (x(t))t [0,1] PX to kx = kx(t)

where PH denotes a suﬃciently regular subset of H[0,1].

1. Note the slight abuse of notation that ku H if u X and kx PH if x PX.

Kernels for sequentially ordered data

2. Signature features for paths in H. We use a feature map that is well-known in stochastic analysis for paths that evolve in linear spaces such as H. This map is called the signature map S. It maps a path h PH into a series of tensors,

h 7 S (h) Y

By applying this map to the lifted path kx PH, we get the feature map PX Q m 0 H m, deﬁned as x 7 S(kx).

3. Signature kernels. Taking inner products yields the kernel

k : PX PX R deﬁned as inner product k (x, y) = S(kx), S(ky) .

It turns out, that k has a recursive structure that allows for an eﬃcient and robust calculation even though kx, ky are paths evolving in the (in general inﬁnitedimensional) state space H.

Below we give more details about the last two points of this construction.

1.2. The signature map

The signature S maps a (suﬃciently regular) path h PH H[0,1] into a series of tensors,

h 7 S(h) = (Sm(h))m 0 Y

where by convention H 0 := R and S0(h) := 1. The remaining terms Sm(h) are deﬁned as follows: the ﬁrst degree, S1(h), of the signature is simply the average change,

S1(h) := Et[ h(t)] H,

where h denotes the derivative of h and the expectation is taken over t sampled uniformly from [0, 1]. The second degree of the signature, S2(h), is the (non-centred) covariance of changes at two subsequent time points, that is

2Et1<t2[ h(t1) h(t2)] H 2,

where the expectation is over the time points t1, t2 sampled from the uniform distribution on [0, 1] and put in chronological order. In general, the m-th degree of the signature, Sm(h), is deﬁned as the m-th moment tensor of the inﬁnitesimal changes, where the expectation is taken over m time points, t1, . . . , tm, that are sampled from the order statistic2 of the uniform distribution on [0, 1], that is

m!Et1< <tm[ h(t1) h(tm)] H m.

2. The order statistic for the uniform distribution [0, 1] of m points has a density p (t1, . . . , tm) = m!1t1< tm<1

Kir aly and Oberhauser

Hence, the signature S(h) of a path h characterises the sequential structure in h by quantifying dependencies in their change, similar to classic sample moments in Rd. In fact, signatures are in close mathematical analogy to polynomials: the feature map x 7 S(h) is injective and if h 7 f(h) is a non-linear function on pathspace, then it can be well approximated as a linear function of the signature features S(h), that is f(h) ℓ, S(h) . However, signature features have the following drawbacks

Even if H = Rd is ﬁnite dimensional, we have Sm(h) Rd m, thus the number of (scalar) coordinates needed to store Sm(h) grows as O(dm) as m increases. This combinatorial explosion makes the use of signature features costly for paths that evolve in high-dimensional state spaces H and infeasible for paths evolving in inﬁnite dimensional state spaces H.

Signatures are only deﬁned for paths evolving in linear spaces; S(h) does not make sense when h evolves in a general topological space X since h(t)dt = dh(t) is not well-deﬁned.

However, as we will show, both of these shortcomings can be addressed by kernelization.

1.3. Inner products of signature features

The space Q m 0 H m is a linear space and by restricting to a subspace, we can work with an inner product space (in fact, again a Hilbert space after completion); that is we deﬁne for (sm), (tm) Q m 0 H m

(s0, s1, . . .) + (t0, t1, . . .) := (s0 + t0, s1 + t1, . . .) ,

(s0, s1, . . .) , (t0, t1, . . .) := X

m 0 sm, tm H m

where , H m is given as u1 um, v1 vm H m := Qm i=1 ui, vi H. The inner product S(g), S(h) of the signature features of (suﬃciently regular) paths g, h PH is ﬁnite and using conditional expectations we get that S(g), S(h) equals

1 m!2 Es1< <sm t1< <tm [ g(s1) g(sm), h(t1) h(tm) H m]

1 m!2 Es1< <sm t1< <tm [ g(s1), h(t1) H g(sm), h(tm) H]

g(s1), h(t1)

22 Es2|s1 t2|t1

g(s2), h(t2)

32 Es3|s2 t3|t2 [. . . ]

Now applied to g = kx PH, h = ky PH, using the reproducing property of k we get

kx(s), ky(t) H ds dt = dg(s), dh(s) H = ds dt k(x(s), y(t)) = κ(ds, dt)

where κ denotes the measure on [0, 1]2 deﬁned as

κ([s, t] [u, v]) = k(x(t), y(v)) k(x(s), y(v)) k(x(t), y(u)) + k(x(s), y(u)).

Kernels for sequentially ordered data

(Note that this measure does not use diﬀerentiability of x, y). Replacing above expectations over the order statistic by explicit integration over [0, 1], we arrive at

k (x, y) = 1 + ˆ

κ(ds1, dt1)

s2>s1 t2>t1

κ(ds2, dt2)

s3>s2 t3>t2 [. . . ]

Finally, the data we have access is to is only a sequence (x(t1), . . . , x(tn)) X+ rather than the whole path x = (x(t))t [0,1] PX, but replacing the integrals by sums immediately gives an explicit recursive formula for a kernel k+ : X+ X+ R that is determined by a Horner-type3 recursive evaluation; especially, one never computes the signature map of a path that evolves in the high/inﬁnite-dimensional Hilber space H directly. This yields a robust and computationally eﬀective formula for a kernel k+ for sequences, that has well-deﬁned behaviour in the scaling limit when sequences approximate paths. We make the above informal derivation rigorous in the following sections, study its properties and extended it to non-smooth paths in Appendix B.

1.4. Prior art and literature review

Learning with sequential data is a vast ﬁeld. Prior art that inspired our present approach can be found in

(i) dynamic programming algorithms for sequence comparison in the engineering community,

(ii) kernel and Gaussian processes for sequences in the machine learning community,

(iii) Rough paths in the stochastic process community.

Beyond the above, we are not aware of literature in statistics that deals with sequencevalued data points in a way other than ﬁrst identifying one-dimensional sequences with real vectors of same size, or even forgetting the sequence structure entirely and replacing the sequences with (order-agnostic) aggregates such as cumulants, quantiles or principal component scores. Below we discuss above three points in more detail:

(i) Dynamic programming algorithms for sequence comparison.

The earliest use of order information in sequences for learning can probably be found in Sakoe and Chiba (1970) and Sakoe (1979) by using editing or distortion distances. These distances are then employed for classiﬁcation by maximum similarity/minimum distance principles. Through theoretical appeal and eﬃcient computability, sequence comparison methods, later synonymously called dynamic time warping methods, have become one of the standard methods in comparing sequential data Kruskal (1983); Giorgino (2009). Sequence comparison methods in their original formulation can only be directly applied to

3. Horner s scheme to evaluate a uni-variate polynomial p(x) = Pm i=0 aixi is to write p(x) = a0 + x (a1 + x (a2 + + x (am 1 + xam))) and compute the brackets from the inside to the outside. This needs only n additions and n multiplications which is optimal and in contrast to the n additions and n2+n

2 multiplications needed for the naive evaluation of p(x); further, it is numerically more stable since one never adds ﬂoats that live on diﬀerent scales (such as a1x and amxm). This structural similarity with formula (1) will be the key to our algorithms.

Kir aly and Oberhauser

relatively simple distance-based learning algorithms. This has been addressed by combining such methods with kernel learning (discussed below).

(ii) Kernels and Gaussian processes for sequences.

Kernel learning is one of the most popular approaches to make non-linear data of arbitrary kind amenable to classic and scalable linear algorithms and it provides learning theoretical guarantees, see Sch olkopf and Smola (2002); Shawe-Taylor and Cristianini (2004); Cucker and Smale (2002). Kernels for strings, that is, sequences of symbols, were among the ﬁrst to be considered by Haussler (1999) and fast algorithms were obtained a few years later Lohdi et al. (2002); Leslie and Kuang (2004). Kernels based on the above-mentioned dynamic programming (time warping) approach were developed, for sequences of arbitrary objects leading to so-called alignment kernel, see Bahlmann et al. (2002); Noma (2002); Cuturi et al. (2007); Cuturi (2011). All of the mentioned kernels can be viewed from Haussler s original, visionary relation-convolution kernel framework, Haussler (1999), the existing literature provides no unifying approach to kernels for sequences: for example the relation between string kernels and dynamic time warping/global alignment kernels, or to the classical theory of time series has remained unclear. Further, the only known kernels for sequences of arbitrary objects, the dynamic time warping/global alignment kernels, are in general not positive deﬁnite.

(iii) Rough paths and stochastic processes.

Series of iterated integrals appear in diverse areas such as control theory, combinatorics, homotopy theory, physics (Feynman Dyson Schwinger theory) and more recently probability theory; see (Lyons, 2004, Section Historic papers , p97). This series is treated under various names in the literature like Magnus expansion , time ordered or non-commutative exponential , or the one we follow, namely the signature of a path . Signatures have found many applications in stochastic analysis (rough path theory, regularity structures), but their use in statistics and machine learning is very recent: Papavasiliou et al. (2011) applies it to SDE parameter estimation; Levin et al. (2013); Lyons et al. (2014) applies it to forecasting and classiﬁcation; Yang et al. (2015) uses signature features as input to deep neural nets. So far, all applications are restricted to paths evolving in low-dimensional linear spaces due to the computational bottleneck given by the combinatorial explosion of signature coordinates.

1.5. Outline

Section 2 introduces the signature as a canonical feature map for path-valued data. Section 3 shows that this signature can be kernelized to get a kernel k for paths and studies its properties. Section 4 studies the discretization to turn k into a kernel k+ for sequences. Section 5 shows that classic kernels for sequences arise as special cases of our construction. Section 6 presents eﬀective algorithms for computing k+ using above recursion with low-rank techniques. Section 7 presents a Numpy implementation and basic benchmarks.

1.6. Notation

Kernels for sequentially ordered data

Spaces, sequences and paths

X an arbitrary set X+ X+ := S ℓ 1 Xℓsequences of arbitrary length in X PX a set of continuous paths from [0, 1] to X H the reproducing kernel Hilbert space, H RX, of the kernel k H+ H+ := S ℓ 1 Hℓsequences of arbitrary length in H+

PH a set of continuous bounded variation paths from [0, 1] to H

Signature features for paths and sequences in H Q m 0 H m the linear space of series of tensors in H S the signature map S : PH Q m 0 H m

S+ the discretized signature map S+ : H+ Q m 0 H m

, inner product on the subset of square-summable elements of Q m 0 H m

norm given by , for square summable elements of Q m 0 H m

k a kernel k : X X R kx with x X the element kx := k(x, ) H RX

kx with x X+ the sequence kx := (kxi)i=1,...,ℓ H+ with x = (xi)i=1,...,l kx with x PX the path kx := (kx(t))t [0,1] PH with x = (x(t))t [0,1] k+ a kernel k+ : X+ X+ R, k+(x, y) := S+(kx), S+(ky) k a kernel k : PX PX R, k (x, y) := S(kx), S(ky)

Partitions of [0, 1]

π a partition of [0, 1], i.e. a collection of points 0 = t1 < < tl = 1 mesh(π) mesh(π) := maxi=1,...,l 1 |ti+1 ti| xπ for x PX xπ := (x(ti))i=1,...,l X+, the path x sampled along π

1.7. Author s contribution

Both authors contributed equally, HO is the corresponding author.

2. The signature feature map

Mapping paths to series of tensors is well-known technique in stochastic analysis. In this section we give a short introduction and highlight aspects of the signature map that make it a good feature map for paths.

2.1. Bounded variation paths

Deﬁnition 2.1 Let H a Hilbert space. We denote the set of H-valued paths of bounded variation on [0, 1] starting at the origin by

C1([0, 1], H) := {h C ([0, 1], H) : h(0) = 0, h 1 < }

Kir aly and Oberhauser

where h 1 = supπ Pl 1 i=1 h(ti+1) h(ti) H and the supremum is taken over all ﬁnite partitions π = {(ti)i=1,...,l : 0 = t1 < < tl = 1} of [0, 1].

As in the ﬁnite-dimensional case, the integral 1 0 y(t) dh(t) H is deﬁned as the limit of Riemann Stieltjes sums, see Appendix A for details. The integral itself is in general not a scalar, but an element of the Hilbert space, y dh H.

Deﬁnition 2.2 We deﬁne the sequence dh m

m 0 Q m 0 H m as4

ˆ dh 0 := 1 and ˆ dh m := ˆ 1

0 dh (m 1) dh(t).

S : C1 ([0, 1] , H) Y

m 0 H m , h 7 ˆ dh m

and call S(h) the signature of the path h and denote Sm(h) = dh m.

Example 2.3 Let H = R2 and h(t) = (a(t), b(t)) . The ﬁrst levels of the signature of h are

S0(h) = 1, S1(h) =

1 0 da(t) 1 0 db(t)

1 0 t2 0 da(t1) da(t2) 1 0 t2 0 da(t1) db(t2) 1 0 t2 0 db(t1) da(t2) 1 0 t2 0 db(t1) db(t2)

The degree m = 3, S3(h) R2 3, has 23-scalar valued coordinates, etc. In general, for a path in Rd we need dm scalars to describe dh m and it is exactly this combinatorial explosion that makes the use of signatures prohibitively expensive already for moderately high-dimensional H and impossible to use for inﬁnite-dimensional H (such as the RKHS associated with most kernels relevant in machine learning).

Example 2.4 For the special case of a linear path in H = Rd, i.e. h(t) = t v for a ﬁxed vector v Rd, a direct calculation shows that

Sm(h) = v m

m! = h(1) m

or in more concise notation5, S(h) = exp(v) = exp(h(1) h(0)). This explains why the signature S(h) is also known as time-ordered exponential. This analogy with the exponential function will be key motivation for the estimates in Section 4.

4. Recall that by convention H 0 := R. 5. where one denotes t = (tm)m Q

m 0 H m as t = t0 + t1 + t2 + , set exp(t) = P

m! and identify v Rd as t = 0 + v + 0 + 0 + .

Kernels for sequentially ordered data

2.2. The signature as a canonical feature map

The signature S(h) describes a path h by an element in the linear space Q m 0 H m and the following properties make it a good feature map

1. The map h 7 S(h) is continuous,

2. The map h 7 S(h) is injective up to tree-like equivalence6,

3. For any f C(K, R), K C1([0, 1], H) compact, and ϵ > 0 there exists a ℓ L m 0 H m such that

sup h K |f (h) ℓ, S (h) | < ϵ.

The proof is classic in stochastic analysis and given in Appendix A; the key insight is that the space of linear combinations of iterated integrals forms an algebra. To sum up

1. the signature features are a mathematically faithful representation of the underlying path-valued data h.

2. linear combinations of signature coordinates approximate continuous functions of paths arbitrarily well. In the kernel learning context this is known as universality .

3. the signature features are sequentially ordered analogues to moments. Similar to polynomials, the feature space has the natural grading by m designating the polynomial degree .

3. Signature kernels

We come back to our original motivation: we are given a RKHS (H, k) on X and we want to transform this into a RKHS for paths in X. Therefore recall that the reproducing property implies that

k : X X R equals k(u, v) = ku, kv H

and that one may think of X u 7 ku( ) := k(u, ) H as the (canonical) feature map induced by the kernel k. Now given a path x PX we lift it to a path7 kx PH given as

6. We call h, h tree-like equivalent if the concatenation h h 1 is tree-like. A path z C1([a, b], H) is called tree-like if there exists a continuous map f : [a, b] [0, ) with f (0) = f (T) = 0 and |z (t) z (s)| f (s) + f (t) 2 infu [s,t] f (u) for all s t [a, b]. The key remark is that being tree-like equivalent is a very strong assumption, typically not seen in real data. Even if presented with a tree-like path, adding time as extra coordinate transforms the path to a non-treelike path (that is, working with t 7 (t, h(t)) instead of t 7 h(t)). 7. Recall the slight clash of notation ku H for u X and kx PH if x PX, but the meaning will always be clear from the context. In fact, later we are also going to use kx H+ if x X+.

Kir aly and Oberhauser

t 7 kx(t) := kx(t)( ) := k(x(t), ). The results of Section 2 suggest S(kx) as features for kx, that is the feature map

PX x 7 S(kx) Y

m 0 H m. (2)

Theorem 1 guarantees that S(kx) captures all the information about x PX that can be obtained by looking at the path x through the kernel k. Most RKHS H in machine learning are inﬁnite dimensional which makes this approach infeasible; even for ﬁnite dimensional H we suﬀer from the combinatorial explosion of the number of signature coordinates, Example 2.3. The main result of this section is that the feature map (2) can be complety kernelized, that is by deﬁning

(x, y) 7 k (x, y) = S(kx), S(ky)

we get a kernel for paths in X that can be very eﬃciently evaluated using only k (x(s), y(t)) for s, t [0, 1]. That is, only evaluations of the static kernel k : X X R. We henceforth restrict attention to positive deﬁnite kernels on X and a set of paths PX that lift to bounded variation paths. All our arguments extend to unbounded variation such as semi-martingales, diﬀusion processes, Markov processes or Gaussian processes and we give the needed modiﬁcations in Appendix B.

Assumption 3.1 Throughout the remainder of this article we assume that k : X X R is continuous and positive deﬁnite. We denote with H the associated RKHS. Further, we denote with PX X[0,1] a set of paths in a topological space X and (H, k) a RKHS on X such that

{t 7 kx(t) : x PX} C1([0, 1], H).

This yields the following kernel,

Deﬁnition 3.2 We call

k :PX PX R , (x, y) 7 S(kx), S(ky) ,

the signature kernel over k.

Theorem 2 The signature kernel over k

k : PX PX R, k (x, y) = S (kx) , S (kx)

1. is a positive deﬁnite kernel,

2. k (x, y) = P m 0 s1< <sm t1< <tm

Qm i=1 dκ(si, ti) with κ deﬁned as

κ ([s, t] [u, v]) := k (x(t), y(v)) k (x(s), y(v)) k (x(t), y(u)) + k (x(s), y(u))

and κ is a signed Borel measure on [0, 1]2 for every x, y PX.

Kernels for sequentially ordered data

Proof Point (1) follows since k is given as an inner product, see Sch olkopf and Smola (2002). Using the deﬁnition of k we calculate

k (x, y) = S(kx), S(ky) = X

m 0 ˆ dk m x , ˆ dk m y H m

0 dk (m 1) x

dkx(sm), ˆ 1

0 dk (m 1) y

sm [0,1] tm [0,1] ˆ sm

0 dk (m) x , ˆ tm 1

0 dk (m 1) y dkx(sm), dky(tm)

sm [0,1] tm [0,1]

sm 1 [0,sm] tm 1 [0,tm] ˆ

s1 [0,s2] t1 [0,t2] dkx(s1), dky(t1) dkx(sm), dky(tm) .

Now consider two paths x, y that are piecewise linear, that is there are time points s1 sk such that x is constant on intervals (si, si+1) and analogous for y there exists t1 tl such that y is constant on all intervals (ti, ti+1). Then

(s1,sk) (t1,tl) dkx(s), dky(t) H

1 i k 1 1 j l 1

(si,si+1) (tj,tj+1) dkx(s), dky(t) H

1 i k 1 1 j l 1

kx(si+1) kx(si), ky(tj+1) ky(tj) H

1 i k 1 1 j l 1

k(x(si+1), y(tj+1)) k(x(si), y(tj+1)) k(x(si+1), y(tj)) + k(x(si), y(tj))

where we used the reproducing property kx(s), ky(t) = k(x(s), y(t)) for the last equality. Note that

κ ([r, s] [u, v]) := k(x(si+1), y(tj+1)) k(x(si), y(tj+1)) k(x(si+1), y(tj)) + k(x(si), y(tj))

deﬁnes a signed measure on [0, 1]2, hence above reads

(s1,sk) (t1,tl) dkx(s), dky(t) H = X

1 i k 1 j l

κ([s1, sk] [t1, tl]) = ˆ

(s1,sk) (t1,tl) dκx,y(s, t).

Combined with the identity (3) this shows (2) for piecewise constant paths. For general x, y ﬁx a sequence {(sn i ) : 0 = s1 < < sn l+1 = 1} [0, 1]n such that maxi |sn i+1 sn i | 0

Kir aly and Oberhauser

as n and denote with xn, resp. yn, the piecewise linear path given by interpolating between points (x(sn i ))i, resp. (y(sn j ))j. Then

κxn,yn([r, s] [u, v]) κx,y([r, s] [u, v])

for all r < s, u < v directly by the deﬁnition of κ. Thus κxn,yn converges weakly to κx,y On the other hand, S (xn) , S (yn) converges to S (x) , S (y) which ﬁnishes the proof by sending n

The equality in Point (2) in Theorem 2 suggests the expansion

k (x, y) = 1 + ˆ

(s1,t1) (0,1) (0,1)

(s2,t2) (0,s1) (0,t1) (1 + ) dκx,y(s2, t2)

dκx,y(s1, t1).

This recursive structure will be the key for an eﬃcient computation.

4. Discretization and recursion: from paths to sequences

To make the kernel k : PX PX R useful for real-world applications, we need to address the following two points:

1. (Sequences). We do not observe the full path (x(t))t [0,1] PX but only a sequence

xπ = (x(ti))l i=1 X+ for a partition π = {0 = t1 < < tℓ= 1}. The partition π might even vary from path to path.

2. (Computational cost). Computing the kernel k : PX PX R by evaluating it as an inner product is computationally prohibitive even with X = Rd = H and the linear kernel k( , ) = , Rd is used if d is large (Example 2.3). It is infeasible for standard kernels such as the Gaussian kernel that have an inﬁnite-dimensional RKHS.

To address point (1), we approximate the signature map

x 7 S(kx) = ˆ dk m x

by discretization of the integrals. This gives a feature map S+ for sequences

xπ 7 S+(kxπ) =

1 i1< <im ℓ 1 i1 kx im kx

where we use the notation i kx := kx(ti+1) kx(ti) and kxπ = (kx(ti))l i=1 H+. A simple calculation shows that this is equivalent to

xπ 7 S+(kxπ) =

i=1 (1 + i kxπ i )

Kernels for sequentially ordered data

if Q denotes the linear extension8 of the tensor product to the linear space Q m 0 H m and this is the deﬁnition we are going to use. The kernel k+ is deﬁned as the inner product of this feature map k+ : X+ X+ R, (x, y) 7 S+(kx), S+(ky)

To address Point (2), we make heavy use of the recursive expansion (4) for k and truncate after a given degree m (that is usually determined by cross-validation to avoid overﬁtting in analogy to the classic polynomial kernel).

4.1. Discretized signatures

Deﬁnition 4.1 We call the map

m 0 H m, (hi)l i=1 7

i=1 (1 + ih)

the discretized signature map (of order 1).

To prove quantitative results about the quality of this discretization, we need to understand how fast S+(kxπ) approximates S(kx) as mesh(π) := maxi |ti+1 ti| 0. We ﬁrst study this approximation of paths by sequences for general paths h PH.

Deﬁnition 4.2 Deﬁne for d 1

expd : Rd R, v = (vi)d i=1 7

i=1 (1 + vi)

Deﬁne for a partition π = (ti)ℓ i=1 with 0 = t1 < < tℓ= 1

h (π) : C1([0, 1], H) R, h 7 h[t1,t2] 1, . . . , h[tℓ 1,tℓ] 1

where h[a,b] denotes the restriction of h C1 ([0, 1], H) to C1([a, b], H).

Using this notation, the main result of this section reads

Theorem 3 For h C1([0, 1], H) and a partition π = (ti)ℓ i=1, 0 = t1 < < tℓ= 1,

S+(hπ) S(h) exp ( h 1) exp1 h (π) ,

where hπ = (h(ti))i=1,...,ℓ H+.

As a consequence, we get explicit convergence rates.

8. For s, t Q

m 0 H m with s0 = t0 = 1 deﬁne s t Q

m 0 H m as having the m-element equal to Pm i=0 sitm i. If we denote s = (s0, s1, . . .) as s = s0 + s1 + , then this reads s t = 1 + s1 + t1 + s2 + s1t1 + t2 + and Q(1 + ih) refers to this product. Informally speaking, we multiply series of tensors in the same way we multiply polynomials.

Kir aly and Oberhauser

Corollary 4.3 With the same notation as in Theorem 3 above, it holds that S+(hπ) S(h) h 1e h 1 max i=1,...,ℓ 1 h[ti,ti+1] 1.

Hence, the convergence lim h (π) h 1 S+(hπ) = S(h), is of order O maxi h[ti,ti+1] 1

Corollary 4.4 With the same notation as in Theorem 3 above, it holds that if π is chosen such that h[ti,ti+1] 1 = h 1

ℓ 1 for all i = 1, . . . , ℓ 1, then one has the asymptotically tight bound

S+(hπ) S(h) e h 1

1 + h ℓ 1 1 (ℓ 3)!

The proofs of Theorem 3 and Corollaries 4.3 and 4.4 are given in Appendix A.

Remark 4.5 (About geometric approximations) The reader familiar with path signatures will note that the approximations S+(hπ) to S(h) are not group-like elements of the tensor (Hopf) algebra Q m 0 H m, thus non-geometric in the sense that there can not exist a path ˆh such that S+(hπ) = S(ˆh). However, as we will see in the following sections, from a computational perspective non-geometric approximations have many beneﬁts. In fact, the non-geometric approximations of this section recover classic machine learning constructions, see Section 5. Nevertheless, one can modify the discussion of this section to produce geometric approximations and we carry this out in Appendix B; see also Remark B.6.

4.2. Discretized signature kernels

Deﬁnition 4.6 Given x = (xi)l i=1 X+ denote kx = (kxi)l i=1 H+. We refer to

k+ : X+ X+ R, k+(x, y) = S+(kx), S+(ky)

as the discretized signature kernel (or sequentialization) over k of order 1.

Theorem 4 The discretized signature kernel over k,

k+ : X+ X+ R, k+(x, y) = S+ (kx) , S+ (kx)

1. is a positive deﬁnite kernel,

2. k+(x, y) = P m 0 P 1 i1< <im<|x| 1 j1< <jm<|y|

Qm r=1 ir,jr k(x, y),

3. k+(x, y) = 1 + P i1 1 j1 1 i1,j1 k(x, y) 1 + P i2>i1 j2>j1 i2,j2 k(x, y) 1 + P i3>j2 j3>j2 . . . ,

where we use the notation x = (xi)|x| i=1 , y = (yi)|y| i=1 X+ and

i,j k(x, y) := k(xi+1, yj+1) + k(xi, yj) k(xi, yj+1) k(xi+1, yj).

Kernels for sequentially ordered data

Proof The ﬁrst point follows, since k+ is an inner product. For the second point note that i kx kxi+1 kxi, i ky kyj+1 kyj and that by the reproducing property i kx, j ky = i,j k(x, y). The identity then follows from multiplying out

S+(kx), S+(ky) = Y

i (1 + i kx), Y

j (1 + j ky)

and using the deﬁnition of the inner product on H m. The last point follows by rearranging the summation.

We can now combine this with Theorem 3 to quantify how fast the kernel k+ for sequences approximates the kernel k for paths as the mesh of partitions gets ﬁner.

Corollary 4.7 Let x, y PX, and π = (si)k i=1 with 0 = s1 < < sk = 1 and ρ = (tj)l j=1 with 0 = t1 < < tl = 1. Then,

k+(xπ, yρ) k (x, y) 4e x 1+ y 1 2e x (π)+ y 1 2e x 1+ y (ρ).

In particular,

lim x (π) x 1 y (ρ) y 1

k+(xπ, yρ) = k (x, y)

where convergence is uniform of order O maxi x[si,si+1] 1 + maxj y[tj,tj+1 1

Proof It holds that

k+(xπ, yρ) k (x, y) = S+(kπ x), S+(kρ y) S(kx), S(ky)

= S+(kπ x), S+(kρ y)) S(ky) + S+(kπ x) S(kx), S(ky) .

The Cauchy Schwarz inequality implies that

| S+(kπ x), S+(kρ y)) S(ky) | S+(kπ x) S+(kρ y) S(ky) .

Theorem 3 implies

S+(kπ x) S+(kρ y)) S(ky) 2 exp( x 1) exp( y 1 exp( y ρ)

and similarly, one obtains

S+(kρ y), S+(kπ x) S(kx) 2 exp( y 1) (exp( x 1) exp( x π) .

Putting all (in-)equalities together yields the main claim, the convergence statement follows from Theorem 3 (ii).

Kir aly and Oberhauser

4.3. Truncation and variations on a theme

The ﬁnal step to get a practically useful kernel for sequences is to truncate the discrete signature features S+(kx) Q m 0 H m at a given degree m 1. In analogy with the classic polynomial features (resp. kernel), this not only serves a computational purpose but is also necessary to avoid overﬁtting; in applications the truncation degree m will be typically determined by cross-validation.

Deﬁnition 4.8 Given an integer m 1, we call

k+ m : X+ X+ R, k+ m(x, y) = S+(kx), S+(ky) m

the discretized signature kernel (or sequentialization) over k truncated at degree m. Here we use the notation kx = (kxi)i H+ for x = (xi)i X+ and , m for the inner product truncated9at m.

Corollary 4.9 The discretized signature kernel k+ m over k truncated at degree m,

k+ m : X+ X+ R, k+ m(x, y) = S+(kx), S+(ky) m,

1. is a positive deﬁnite kernel,

2. k+ m(x, y) = Pm d=1 P 1 i1< <id<|x| 1 j1< <jd<|y|

Qd r=1 ir,jr k(x, y)

3. k+ m(x, y) = 1 + P |x|>i1 1 |y|>j1 1 i1,j1 k(x, y)

1 + (1 + P |x|>im>jm 1 |y|>jm>jm 1 im,jm k(x, y))

where we denote x = (xi)|x| i=1 , y = (yi)|y| i=1 X+ and

i,j k(x, y) := k(xi+1, yj+1) + k(xi, yj) k(xi, yj+1) k(xi+1, yj).

While above follows as simple corollary from the discussion so far, the identity Point (3) will be the key to an eﬃcient algorithm; it is clearly more eﬃcient than computing ﬁrst S+(x) and S+(y) and then their inner product; but it is also much more eﬃcient than the identity in Point (2) since it only uses i,j k(x, y) for each i, j once.

Remark 4.10 (Variations on a theme) Modiﬁcations of the above are possible. For example,

(i) One can omit the ﬁrst diﬀerences, that is consider the sequentialization

k+ : X+ X+ R; (x, y) 7 1 + X

m 1 1 i1< <im |x| 1 j1< <jm |y|

k=1 k(xik, yjk).

In feature space, this amounts to replacing the path (kt)t [0,1] by the path ( t 0 ks ds)t [0,1] and calculating the signature of the latter.

9. (s0, s1, . . .), (t0, t1, . . .) m := (s0, s1, . . . , sm, 0, . . .), (t0, t1, . . . , tm, 0, . . .)

Kernels for sequentially ordered data

(ii) More generally, one may replace for each m, the inner product Qm k=1 k(xik, yjk) by a kernel km : Xm Xm R. This leads to a sequentialization of the family of kernels (km)m given as

k+ : X+ X+ R; (x, y) 7 X

m 1 1 i1< <im |x| 1 j1< <jm |y|

km((xi1, . . . , xim), (yj1, . . . , yjm)).

This corresponds to choosing diﬀerent kernels on diﬀerent levels of the tensor algebra, e.g., re-normalization.

(iii) One may (additionally) opt to remember the position of elements in the sequence. This can be done by mapping sequences in X+ to a position-remembering sequence in (X N)+ ﬁrst, and then carrying out the sequentialization in (ii) for the family of kernels (km)m where km : (X N)+ (X N)+ R is deﬁned as

κ ((i1, . . . , im), (j1, . . . , jm)) km ((xi1, . . . , xim), (yj1, . . . , yjm)) .

and κ : N+ N+ R denotes an arbitrary positive deﬁnite kernel. Further, one can replace the strict inequalities < in the summation over 1 i1 < < im |x| and 1 j1 < < jm |y| by .

Further, Assumption 3.1, especially the continuity of k can be dropped and any of above sequentializations gives a kernel for sequences (but convergence to a kernel on paths is no longer guaranteed). There are many other modiﬁcations and above choices, while minor, seem somewhat arbitrary. The above have their justiﬁcation in explaining prior art, see Section 5 below.

5. Revisiting classic kernels for sequences: string, alignment and convolution kernels

In this section we show that the discretized signature kernel is closely related to the existing kernels for sequential data, in the sense that classic kernels for sequences arise as special cases of our approach, namely:

(a) String kernels Lohdi et al. (2002); Leslie and Kuang (2004) can be seen as a case of discretized signature kernels.

(b) The global alignment kernel Cuturi et al. (2007); Cuturi (2011) can be obtained from a special case of the discretized signature kernel, by deleting some smaller terms, in the process destroying positive deﬁniteness.

(c) The discretized signature kernel can be related to the framework of the relationconvolution kernel Haussler (1999) with the relation being a subsequence .

The results of this section are not important for using our method as presented in the previous and following section, but it might be of interest to researchers who are familiar with above kernels.

Kir aly and Oberhauser

5.1. The string kernel

The string kernel is a kernel for sequences in a ﬁnite set X; in this context, the elements of X are usually called letters and X is called the alphabet. Two important examples are the Latin alphabet X = {A, B, . . . , Z, a, b, . . . , z} for classic text mining, or X being DNA nucleotides, proteins, etc., in bioinformatics.

Deﬁnition 5.1 (Deﬁnition 1, page 423 of Lohdi et al. (2002)) Fix a ﬁnite set X and a parameter λ [0, 1]. The string kernel on X with parameter λ is deﬁned as

kstring : X+ X+ R (x, y) 7 X

w X+ Φw(x)Φw(y)

1 i1< <im r λ|im i1+1|1(xi1,...,xik)=w for w X+.

Proposition 5.2 Fix a ﬁnite set X and a parameter λ [0, 1]. Deﬁne a kernel

k : X X R, k(a, b) = 1a=b.

Then the sequentialization of k given in Remark 4.10 (i), equals the string kernel with parameter λ = 1 up to an additive constant 1, that is k+(x, y) = 1 + kstring(x, y) for all x, y X+.

Proof We apply Theorem 4 in the variation given in Remark 4.10 (i) to the kernel k to get

k+(x, y) = 1 + X

1 i1< <im |x| 1 j1< <jm |y|

r=1 k(xir, yjr) = 1 + X

1 i1< <im |x| 1 j1< <jm |y|

1xi1=yj1 1xim=yjm

1 i1< <im |x| 1 j1< <jm |y|

1(xi1,...,xim)=(yj1,...,yjm)

w X+,|w|=m Φw(x)Φw(y) = 1 + X

w X+ Φw(x)Φw(y) = 1 + kstring(x, y)

More generally, for any λ [0, 1] one can apply the κ-sequentialization given in Remark 4.10 (iii), to the kernel k(a, b) = 1a=b and

κ((i1, . . . , im), (j1, . . . , jm)) = λim i1+1λjm j1+1

to recover the string kernel kstring for the parameter λ; this follows as above. Similarly, various variants of the vanilla string kernel such as gappy kernels, Leslie and Kuang (2004) can be recovered10. Note that the +1 in kstring +1 = k+ arises since in string kernels one usually does not count the empty word (if we count the empty word then k+ = kstring). However, this is merely convention and of non-importance for the Gram matrix.

10. As one of the reviewers pointed out, a very general framework for sequences that encompasses string kernels are so-called rational kernels.

Kernels for sequentially ordered data

Remark 5.3 Implicit in above discussion is the identiﬁcation of a string formed from an alphabet X as a lattice path in H = R|X|, that is the string x = (ai1, . . . , ail) is identiﬁed as the lattice path in the vector space R|X| for which we identify X as an orthnomal basis (the free vector space over X),

x C1([0, l], H) x(t) = {t} ai t +

where t is the ﬂoor function of t, and {t} is the fractional part of t. We deal with nongeometric rough paths , see Remark 4.5.

5.2. The global alignment kernel

The global alignment kernel is one of the most used kernels for sequences. We give its deﬁnition in modern terminology (Section 2.2 of Cuturi (2011)).

Deﬁnition 5.4 Deﬁne

A(m, n) = π = (πi)k i=1 N2 + : k m + n 1, π1 = (1, 1), πk = (m, n),

(πi+1 πi) {(0, 1), (1, 0), (1, 1)}

We call an element of A(m, n) an alignment of (m, n).

Deﬁnition 5.5 For a set X and a kernel k : X X R, the global alignment kernel is deﬁned as

k GA : X+ X+ R (x, y) 7 X

π A(|x|,|y|) e Dx,y(π)

where Dx,y(π) = P|π| i=1 k(xπ1 i , yπ2 i ), and we denote with |x|, |y| the length of the sequences x, y and we denote the entries of an alignment π = (π1, . . . , πk) as πi = (π1 i , π2 i ) N2.

In its native form, the global alignment kernel cannot be written as a discretized signature kernel since these preserve positive deﬁniteness, while the global alignment kernel does not (see the discussion in Cuturi (2011)). However, a simple modiﬁcation turns the global alignment kernel into a positive deﬁnite kernel that can be obtained by sequentialization.

Deﬁnition 5.6 Deﬁne

2 (l) = {π = (π1, . . . , πk) N1 + : πk l, πi+1 πi {0, 1}}

and call an element of A 1

2 := S l 1 A 1

2 (l) a half-alignment.

Deﬁnition 5.7 Fix a set X and a kernel k : X X R. The global half-alignment kernel is deﬁned as

k GHA : X+ X+ R (x, y) 7 X

e Dx,y(π,ρ)

where Dx,y(π, ρ) = P|π| i=1 k(xπi, yρi).

Kir aly and Oberhauser

Proposition 5.8 Fix a set X and a positive deﬁnite kernel k : X X R. Then the global half-alignment kernel k GHA equals the κ-sequentialization of the kernel (x, y) 7 e k(x,y)

given in Remark 4.10(iii), that is k GHA(x, y) = k+(x, y) with κ(π, ρ) = 1π A 1

Proof The κ-sequentializtion yields

k+(x, y) = X

m 1 1 π1 πm |x| 1 ρ1 ρm |y|

κ((π1, . . . , πm), (ρ1, . . . , ρm))

r=1 k(xπr, yρr)

1 π1 πm |x| 1 ρ1 ρm |y|

1(π1,...,πm) A 1

2 1(ρ1,...,ρm) A 1

r=1 e k(xπr,yρr)

e Pm i=1 k(xπi,yρi) = k GHA(x, y).

Remark 5.9 The terms missing in k GA, when compared to k GHA are those arising from sequence pairs π N2 + in which there is an increment (0, 0). In view of the discussion in Section 3.2 of Cuturi (2011), these missing terms are exactly the locus of non-transitivity in the sense of Shin and Kuboyama (2008). Concerning the possible non positive deﬁniteness of k GA, one can follow Cuturi (2011) and study suﬃcient conditions under which the original global alignment kernel is positive deﬁnite.

5.3. The relation-convolution kernel

Finally, we would like to point out that the discretized signature kernels are closely related to the relation-convolution kernels, see Haussler (1999), Section 2.2.

Deﬁnition 5.10 Fix kernels ki : Xi Xi R for i = 1, . . . , m. Fix a relation R between the sets X1 Xm and X, that is a subset R X1 Xm X and say that (x1, . . . , xm) X1 Xm decomposes x X if (x1, . . . , xm, x) R. The relationconvolution kernel of (k1, . . . , km) is deﬁned as

k RC : X X R, (x, y) 7 X

(x1,...,xm):((x1,...,xm),x) R (y1,...,ym):((y1,...,ym),y) R

i=1 ki(xi, yi).

The intuition for above deﬁnition is that we can measure similarity of two objects x, y X by decomposing them into substructures and comparing these substructures.

Proposition 5.11 Let X be any set and k : X X R. Then the sequentialization k+ : X+ X+ R given in Remark 4.10 (i) is a relation-convolutional kernel for the relation

Kernels for sequentially ordered data

being an ordered subsequence . More precisely, deﬁne X1 = N, Xi = X for i = 2, . . . , m+1 and

k1(i, j) = 1i=j and ki(x, y) = k(x, y) for i = 2, . . . , m + 1.

R = {(l, x1, . . . , xm+1, x ) X1 Xm+1 X+ : l m and i1 < < il such that (x 1, . . . , x l) = (xi1, . . . , xil)}

the relation-convolutional kernel k RC equals k+.

The proof follows by spelling out the deﬁnitions of k RC and k+. It is interesting to note the native discretized signature kernel of Theorem 4 does not, at least naturally, fall into Haussler s convolutional kernel framework since it heavily relies on the additional diﬀerences k. (Above variation is only obtained after the summation step described in Remark 4.10).

6. Eﬃcient algorithms

In this section, we design eﬀective algorithms to evaluate or approximate the (n n)-Gram matrix k+ m(xi, xj)

i,j {1,...,n} of n sequences x1, . . . , xn X+. The results are summarized in Table 2. The key are the recursive Horner-type formula given in Corollary 4.9 together with ideas from dynamic programming and low-rank approximations.

method algorithm time storage naive evaluation Deﬁnition 4.8 O(n2 ℓ dim(H)m) O(n2 dim(H)m) dynamic programming Algorithm 3 O(n2 ℓ2 m) O(ℓ2) DP & LR, per-element Algorithm 4 O(n2 ℓ ρ m) O(ℓ ρ) DP & LR, per-sequence Section 6.3.1 O((n + ρ) ℓ2 ρ m) O(n ℓ2) DP & LR, simultaneous Algorithm 5 O(n ℓ ρ m) O(n ℓ ρ)

Table 2: Computational time (in elementary arithmetic operations) and storage cost for computing (an approximation) to the (n n)-Gram matrix k+ m(xi, xj)

i,j {1,...,n} between sequences x1, . . . , xn X+ of length at most ℓ, that is |xi| ℓfor i = 1, . . . , n. Methods: DP = dynamic programming, LR = low-rank. In the low-rank methods, ρ is a meta-parameter which controls accuracy of approximation.

Remark 6.1 Naive evaluation of k+ m(x, y) by directly calculating it as an inner product of signature features incurs an exponential storage cost O(dim(H)m). Even for H = Rd this becomes quickly very expensive and infeasible when dim(H) = which is the case for most practically relevant kernels.

Remark 6.2 Above algorithms are already powerful for H = Rd with the inner product in Rd as kernel k. In this case, k+ x S(kx) = S( kx, ) S(x) Q m 0 Rd m, that is the

feature map for a path x is the usual signature S(x) Q m 0 Rd m. Compared to naive computation of S(x), Algorithm 5 gives a massive reduction in complexity (from exponential to linear!).

Kir aly and Oberhauser

Remark 6.3 For readability we have absorbed in Table 2 the computational time c that is needed to evaluate k(x, y) for x, y X into the big O bound since this is just a multiplicative factor. E.g. the computational time for Algorithm 3 would read O(c n2 ℓ2 m) if we include c.

6.1. Notation and subroutines for computations with arrays (tensors)

We introduce notation (inspired by Python/Numpy) and fast algorithms for dynamic programming subroutines. This allows us to express our algorithms as operations on arrays alone and makes it easy to implement them in high-level languages that are optimized for operating on arrays (such as Numpy, etc).

Notation 6.4 (Arrays/tensors) We will denote the (i1, . . . , ik)-th element of a k-fold array (= degree k tensor) A by A[i1, . . . , ik]. Occasionally, for ease of reading, we will use | instead of , as a separator, for example A[i1, i2|i3, . . . , ik] if the indices on the left side of | are semantically distinct from those on the right side. The arrays all contain elements in R, and the indices will always be positive integers, excluding zero.

Notation 6.5 (functions applied elementwise) For a function f : R R, and an array A, we will denote by f(A) the array where f is applied element-wise. I.e., f(A)[i1, . . . , ik] = f(A[i1, . . . , ik]). Similarly, for f : Rm R and arrays A1, . . . , Am, we denote

f(A1, . . . , Am)[i1, . . . , ik] = f(A1[i1, . . . , ik], . . . , Am[i1, . . . , ik]).

For example, 1

2 A2 is the array A having all elements squared, then divided by two. The array A + B contains, element-wise, sums of elements of A and B.

Notation 6.6 (slice, shift, cumulative sum) Let A be a k-fold array of size (n1 nk).

(i) For an index ij (at j-th position), we will write A[:, . . . , :, ij, :, . . . , :] for (k 1)-fold array of size (n1 nj 1 nj+1 . . . nk) such that

A[:, . . . , :, ij, :, . . . , :][i1, . . . , ij 1, ij+1, . . . , ik] = A[i1, . . . , ik].

We deﬁne in analogy, iteratively, A[:, . . . , :, ij, :, . . . , :, ij , :, . . . , :], and so on. Arrays of this type are called slices (of A).

(ii) For an integer m, we will write A[:, . . . , :, +m, :, . . . , :] for the k-fold array of size (n1 nj 1 (nj + m) nj+1 nk) such that

A[:, . . . , :, ij, :, . . . , :][i1, . . . , ij 1, ij + m, ij+1, . . . , ik] = A[i1, . . . , ik],

and where non-existing indices of A are treated as zero. Arrays of this type are called shifted (versions of A). For negative m, the shifts will be denoted with a minus -sign instead of a plus -sign.

Kernels for sequentially ordered data

(iii) We will write A[:, . . . , :, , :, . . . , :], where is at the j-th position, for the k-fold array of size (n1 nk) such that

A[:, . . . , :, , :, . . . , :][i1, . . . , ik] =

κ=ij A[i1, . . . , ij 1, κ, ij+1, . . . , ik].

Arrays of this type are called slice-wise cumulative sums (of A).

(iv) We will write A[:, . . . , :, Σ, :, . . . , :], where Σ is at the j-th position, for the (k 1)- fold array of size (n1 nj 1 nj+1 nk) such that A[:, . . . , :, Σ, :, . . . , : ][i1, . . . , ik 1] = Pnj κ=1 A[i1, . . . , ij 1, κ, ij, . . . , ik 1]. Arrays of this type are called slicewise sums (of A).

We will further use iterations and mixtures of the above notation, noting that the indexwise sub-setting, shifting, and accumulation commute with each other. Therefore expressions such as A[+1, : |Σ, +2] or A[j| :, +3, ] are well-deﬁned, for example. We will also use the notation A[ + m, . . . ] to indicate the shifted variant of the cumulative sum array.

Cumulative sums can be computed eﬃciently, in the order of the size of an array, as opposed to squared complexity of more naive approaches. The algorithm is classical, we present it for the convenience of the reader in Algorithms 1 and 2.

Algorithm 1 Computing the cumulative sum of a vector. Input: A 1-fold array A of size (n) Output: The cumulative sum array A[ ]

1: Let Q A.

2: for κ = 2 to n do

3: Q[κ] Q[κ 1] + A[κ]

5: Return Q

Algorithm 2 Computing the cumulative sum of an array. Input: A k-fold array A of size (n1 nk) Output: The cumulative sum array A[ , . . . , , :, . . . , :] (up to the m-th index)

2: for κ = 2 to m do

3: Let Q Q[:, . . . , :, , :, . . . , :] (at the κ-th index), where the right side is computed via applying algorithm 1 slice-wise.

5: Return Q

Kir aly and Oberhauser

6.2. Computing the discretized signature kernel

Algorithm 3 Evaluation of k+ m. Input: Sequences x, y X+. A kernel k : X X R. A truncation degree m. Output: k+ m(x, y)

1: Set ℓx = |x| 1, ℓy = |y| 1

2: Compute the (ℓx ℓy)-array K with entries K[i, j] = i,j k(x, y)

3: Initialize an (m ℓx ℓy)-array A

4: Set A[1| :, :] K

5: for d = 2 to m do

6: Compute Q A[d 1| , ]

7: Set A[d| :, :] K (1 + Q[+1, +1])

9: Compute R 1 + A[m|Σ, Σ]

10: Return R

The correctness of above Algorithm follows directly from the recursive formula (3) in Corollary 4.9: we start by evaluating the innermost parenthesis and then use the summation to build the content of the next parenthesis. In symbols, deﬁne for d = 1, . . . , m 1 ,

A1 i,j := i,j k(x, y) and Ad+1 i,j := i,j k(x, y)

then k+ m(x, y) = 1 + P i 1 j 1 Am i ,j . In above Algorithm 3 we denote A[d|i, j] = Ad i,j.

Disregarding the cost of computing the (ℓx ℓy)-matrix ( i,j k(x, y))i,j, the computational cost of Algorithm 3 is O(m|x||y|) elementary arithmetic operations (= the number of loop elements) and O(m|x||y|) units of elementary storage. The storage requirement can be reduced to O(|x||y|) by discarding A[d 1| :, :] from memory after step 7 each time.

In each run of the loop, a matrix Q is pre-computed to avoid a ﬁve-fold loop that would be necessary with the more naive version of line 7,

A[m|i, j] A[m|i, j]

j >j A[m 1|i , j ]

that leads to a blown up computational cost of O(mℓ2 xℓ2 y) at the asymptotically insignificant gain of storing one (ℓx ℓy)-matrix less (the matrix Q).

Kernels for sequentially ordered data

6.3. Large scale strategies

By using Algorithm 3 one can compute the Gram matrix k+ m(xi, xj)

i,j=1,...,n of n sequences x1, . . . , xn X+ in O(n2 ℓ2 m) elementary arithmetic operations. While this is for moderate sizes of n,m and ℓachievable on contemporary desktop computers, it becomes quickly prohibitive for large n, ℓdue to the quadratic growth especially when combined with parameter tuning or cross-validation schemes (as later in our experiments). We present two low-rank techniques to address this: the ﬁrst is classic and reduces the quadratic cost in n to linear, the second reduces the quadratic cost in ℓto linear. Finally, we combine these two techniques to get O(n ℓ m ρ) computation time with ρ denoting an approximation parameter.

6.3.1. Low-rank methods for the sequence-vs-sequence kernel matrix

Since k+ m(xi, xj)

i,j=1,...,n is a Gram-matrix, it is directly amenable to large-scale variants of low-rank type. Strategies of this kind include the incomplete Cholesky decomposition, Nystr om approximation, or the inducing point formalism in a Gaussian process framework. For n sequences, all strategies mentioned above require evaluation of at most an (n r) and an (r r) matrix (where r is a meta-parameter), which costs O((r + n) r ℓ2 m) elementary operations and O((r + n) r ℓ2) storage11. Any of these low-rank strategies reduces the complexity of the Gram matrix to be linear in n, but not in ℓsince the strategy is completely independent of how the discretized signature kernel was evaluated at a given pair (xi, xj) X+ X+. For the same reason, any improvements on the cost of evaluating k+ m(xi, xj) will combine with this strategy.

6.3.2. Low-rank methods for the element-vs-element kernel matrix

The evaluation k+ m(xi, xj) for a pair (xi, xj) X+ is given by nested partial summations and multiplications applied to the (ℓj ℓj)-matrix (k(xi,r, xj,s))r,s where r = 1, . . . , ℓi, s = 1, . . . , ℓj and we denote xi = (xi,1, . . . , xi,ℓi), xj = (xj,1, . . . , xj,ℓj) X+. The low-rank methods of the previous paragraph cannot be applied naively to this (ℓi ℓj)- matrix: ﬁrstly, this matrix is in general non-symmetric; secondly, the summationand multiplication-type operations require in naive form at each recursion step access to the full (ℓi ℓj)-matrix. The ﬁrst issue is easily addressed by replacing the respective symmetric decomposition with the analogous non-symmetric one. The second issue is much harder to deal with, but below we show that by working exclusively on low-rank factorizations in each recursion step this issue can also be dealt with.

Deﬁnition 6.8 Let A be an (a b)-matrix. For U an (a r)-matrix, and V a (b r)-matrix, we say that (U, V ) is a low-rank presentation of A, of rank r, if

r =1 Ui,r Vj,r

With denoting the usual matrix multiplication this is equivalent to A = U V .

11. followed by a slightly modiﬁed variant of the learning algorithm itself which usually costs O((r + n) r2) elementary operations and storage (at most) instead of the unmodiﬁed variant which usually costs O(n3) (at most).

Kir aly and Oberhauser

The fact that A has a low-rank presentation of rank r does imply that A is of rank r or less (by equivalence of matrix rank with decomposition rank), but it does not imply that A is of rank exactly r. We state a number of straightforward but computationally useful Lemmas that show how low-rank matrices behave under shifting, summation and cumulative summation, componentwise addition and multiplication. This allows us to replace these operations on matrices in Algorithm 4 by the operations on low-rank matrices.

Lemma 6.9 (Low-rank under cumulative summations and shifts) Let (U, V ) be a low-rank representation of rank r for an (a b)-matrix A. Then

(i) Deﬁne A and A as A i,j = P i i Ai ,j and A i,j = P j j Ai,j . Deﬁne U as U i,r = P i i Ui ,r and V as V j,r = P j j Vj ,r. Then (U , V ) is a low-rank representation or rank r of A of rank and (U, V ) is a low-rank representation of rank r of A .

(ii) Deﬁne A and A as A i,j = Ai+s,j and A i,j = Ai,j+s for s 1. Deﬁne U and V as U i,r = Ui+s,r1i+s a and V j,r = Vj+s,r1j+s b. Then (U , V ) is a low-rank representation of A of rank r and (U, V ) is a low rank representation of rank r of A .

Proof This follows by spelling out the deﬁnition: for (i)

i i Ai ,j = X

r =1 Ui ,r Vj,r =

r =1 U i,r Vj,r ;

A i,j = Ai+s,j =

r =1 Ui+s,r Vj,r =

r =1 U i,r Vj,r .

The statements for A follow similarly.

Lemma 6.10 (Low rank under addition and multiplication) Let A1, A2 be (a b)- matrices such that (U, V ) is a low-rank representation of A1 of rank r1, and (R, S) is a low-rank representation of A2 of rank r2. Then

(i) (U , V ) is a low rank representation of A1 + A2 of rank r1 + r2 where

( Ui,r , if 1 r r1 Ri,r r1 , if 1 + r1 r r1 + r2 , V j,r =

( Vj,r , if 1 r r1 Sj,r r1 , if 1 + r1 r r1 + r2

(ii) ( U, V ) is a low rank representation of12 A1 A2 of rank r1r2 where

Ui,r = Ui,a Ri,b and Vj,r = Vj,a Sj,b

and a, b are such that r = (a 1) r2 + b 1 with a {1, . . . , r1} and b {1, . . . , r2}

12. We denote the elementwise product of matrices with A B, that is (A B)i,j = Ai,j Bi,j.

Kernels for sequentially ordered data

Proof Since the (i, j)-entry of the matrix A1+A2 equals Pr1 r=1 Ui,r Vj,r+Pr2 r =1 Ri,r Sj,r and the (i, j)-entry of A1 A2 equals Pr1 r=1 Ui,r Vj,r Pr2 r =1 Ri,r Sj,r , the result follows by substituting the two sums over r {1, . . . , r1} and r {1, . . . , r2} by one sum over {1, . . . , r1+r2} resp. over {1, . . . , r1r2}.

In view of Point (ii) of above Lemma we introduce new array notation:

Notation 6.11 Let A1 be a k-fold array of size (n1 nk) and A2 be a k-fold array of size (n1 nk 1 n k). Write A1 A2 for the (n1 nk 1 (nk n k)) array with entries

A1 A2[i1, . . . , ik] = A1[i1, . . . , ik 1, a]A2[i1, . . . , ik 1, b]

where a, b are such that ik = (a 1) n2 + b 1 with a {1, . . . , n1} and b {1, . . . , n2}.

Algorithm 4 Evaluation of k+ m, with low-rank speed-up. Input: Sequences x, y X+. A kernel k : X X R. A truncation degree m 1. An integer 1 r min(ℓx, ℓy). Output: An approximation to k+ m(x, y)

1: Set ℓx |x| 1 and ℓy |y| 1

2: Compute a low-rank representation (U, V ) of rank r, approximating the element-vselement matrix/array with entries K[i, j] = i,j k(x, y), i {1, . . . , ℓx}, j {1, . . . , ℓy}

3: Initialize an (m ℓx )-array B and an (m ℓy )-array C (* means that the size may change dynamically)

4: Set B[1| :, :] U and C[1| :, :] V

5: for d = 2 to m do

6: P B[d 1| +1, :] and Q C[d 1| +1, :]

7: Append an (ℓx 1)-array of 1 s to P and append an (ℓy 1)-array of 1 s to Q

8: Set B[d| :, :] (U P)[:, :] and C[d| :, :] (V Q)[:, :]

9: optional: simplify the low-rank presentation (B, C), reducing its rank

10: end for

11: Set R B[m|Σ, :] and S C[m|Σ, :]

12: Return 1 + (R S)[Σ]

Algorithm 4 ﬁrst approximates the element-vs-element (ℓx ℓy)-matrix K = ( i,j k(x, y))i,j by a low rank approximation (U, V ). It then uses the formula for k+ m given in Corollary 4.9 and updates the low-rank representation in each recursion step (starting from the innermost parenthesis):

The ﬁrst time line 6 is called it uses Lemma 6.9 to calculate a low-rank representation (P, Q) of the (ℓx ℓy)-matrix P i >i,j >j i ,j k(x, y)

i,j from the low-rank

approximation (U, V ) for K = ( i,j k(x, y))i,j; later calls of line 6 do the same but with K replaced by another matrix with low-rank factorization stored in B[d| :, :] and C[d| :, :].

Kir aly and Oberhauser

The ﬁrst time line 7 is called, it calculates a low-rank representation ( P, Q) of the (ℓx ℓy)-matrix (1 + (P Q )i,j)i,j by using (P, Q) and the trivial identity

k=1 Pi,k Qj,k =

k=1 Pi,k Qj,k

where P and Q are given by adding a (r + 1)-th column consisting of 1 s. Note that the rank increases by one by going from (P, Q) to ( P, Q). Later calls do the same but with (P, Q) replaced by another matrix with low-rank factorization stored in B[d| :, :] and C[d| :, :].

Line 8 calculates a low-rank representation ( U, V ) for the elementwise product between the (ℓx ℓy)-matrices K = U V and P Q . By Lemma 6.10 it is given as Ui, r = Ui,a Pi,b and V r,j = Va,j Qb,j.

Line 9 is optional, aiming at keeping the low-rank presentation small13.

Line 11 uses that

i 1,j 1 (R S )i,j = X

k Ri,k Sj,k = X

i 1 Ri,k)( X

(Recall that R S denotes elementwise multiplication of arrays R, S and R S denotes the usual matrix multiplication if R, S are interpreted as matrices).

The computational cost of Algorithm 4 is of the same order as the maximum size of B and C. That is, if ρ is the smallest integer such that at any time B requires ℓx m ρ space, and C requires ℓy m ρ space, then the computational complexity of Algorithm 4 is14 O((ℓx + ℓy) ρ m).

6.3.3. Simultaneous low-rank methods

Algorithm 4 yields an eﬃcient low-rank speed-up for computing a single entry k+ m(xi, xj). Thus total computational cost for the Gram matrix (k+ m(xi, xj))i,j {1,...,n} is O(n2 ℓ ρ m) and cost quadratic in the number of data points n may be prohibitive on large scale data. We address this by combining both low-rank strategies mentioned before to achieve a further reduction of computational cost from O(n2 ℓ ρ m) to O(n ℓ ρ m).

13. This can be achieved for example via singular value decomposition, sub-sampling, or random projection type techniques. 14. Since one can always choose a low-rank representation of B[m| :, :] and C[m| :, :] of rank min(ℓx, ℓy) or less, the computational complexity is bounded by O(ℓx ℓy m) (which is the complexity of Algorithm 3).

Kernels for sequentially ordered data

Algorithm 5 Computation of the Gram matrix of k+ m, with (double) low-rank speed-up. Input: Sequences x1, . . . , xn X+. A kernel k : X X R. A truncation degree m. An integer r 1. Output: A low-rank approximation (U, U) of the Gram matrix K = k+ m(xi, xj)

i,j=1,...,n.

1: Compute arrays U(i) such that each pair (U(i), U(j)) i, j {1, . . . , n} is a low-rank presentation of rank r for the ((|xi| 1) (|xj| 1))-matrix K(ij) with entries K(ij) a,b = a,b k(xi, xj)

2: Initialize an (n m )-array B (where * means that the sizes may change dynamically)

3: B[i|1| :, :] U(i) for all i {1, . . . , n}

4: for d = 2 to m do

5: Compute P B[: |d 1| +1, :]

6: Set κ, ρ such that (n κ ρ) is the size of P

7: Append an (n κ 1)-array of ones to P

8: B[: |d| :, :] B[: |1| :, :] P[:, :, :]

9: optional: simplify the low-rank presentation encoded in B, reducing its rank

10: end for

11: Compute U B[: |m|Σ, :]

12: Return U

Algorithm 5 is obtained as follows:

The previous Algorithm 4 transforms for any pair xi, xj X+ a low rank representation (U(ij), V (ij)) of the (ℓi ℓj)-matrix K(ij) = ( a,b k(xi,a, xj,b))a,b into a low-rank representation (R(ij), S(ij)) such that

k+(xi, xj) = 1 + X

R(ij) (S(ij))

But if U(ij) is independent of j and if V (ij) is independent i, then R(ij) will depend by Algorithm 4 only on i and S(ij) will depend only on j; hence R(ij) = S(ji). If we denote R(i) := R(ij) := S(ji), then in above Algorithm 5, B[i, m, a, r] equals R(i) a,r when the end of the for loop is reached, line 10.

k+ m(xi, xj) = 1 + X

R(i) (R(j))

a 1 R(i) a,k X

b 1 R(j) b,k

k=1 Ri,k Rj,k

where we denote with r the rank of (R(i), R(j)) and set Ri,k := P a 1 R(i) a,k and Ra,k := 1 if k = r +1. Hence, ( R, R) is a low-rank representation of the (n n)-Gram matrix K = (k+ m(xi, xj))i,j. In Algorithm 5, the array entry U[i, r ] equals Ri,r , line 11.

Kir aly and Oberhauser

Remark 6.12 Line 1 requires that for each pair of sequences xi = (xi,a)a=1,...,ℓi, xj = (xj,b)b=1,...,ℓj X+ we compute the (ℓi ℓj)-matrix K(ij) with entries K(ij) a,b = k(xi,a, xj,b). That is, the matrices U(i), when row-concatenated, should have low rank15. Jointly low-rank U(i) can be obtained by running a suitable joint diagonalization or singular value decomposition scheme on the matrices K(ij).

Remark 6.13 (Fast sequential kernel methods) By Section 5, fast string kernel methods such as the gappy, substitution, or mismatch kernels presented in Leslie and Kuang (2004) may be transferred to general sequential kernels. In general, this amounts to small modiﬁcation of Algorithm 3; for example, to obtain a gappy variant of the sequential kernel, summation in line 6 of Algorithm 3 over the whole matrix, of quadratic size, is replaced by summation over a linear part of it.

7. Experimental validation

We perform two experiments to validate the practical usefulness of the signature kernels:

(1) On a real world data set of hand movement classiﬁcation (eponymous UCI data set Sapsanis et al. (2013)), we show the discretized signature kernel outperforms the best previously reported predictive performance Sapsanis et al. (2013), as well as non-sequential kernel and aggregate baselines.

(2) On a real world data set on hand written digit recognition (pendigits), we show that the discretized signature kernel over the Euclidean kernel (= linear use of signature features) achieves only sub-baseline performance. Using the discretized signature kernel over a Gaussian kernel improves prediction accuracy to the baseline region.

We emphasize that our experiments do not constitute a systematic benchmark comparison to prior work, only validation that the signature kernel is a practically meaningful concept: experiment (1), shows that the sequentialization of standard kernels can achieve state-of-the-art performance on time series data; experiment (2) shows that even for paths in low dimensions, X = R2, using a non-linear static kernel for the sequentialization outperforms the linear kernel (the latter corresponds to learning with signature features of a 2-dimensional path; the former, corresponds to learning with signature features of the path lifted to the RKHS H of the non-linear kernel). A systematic benchmark comparison is likely to require a larger amount of work, since it would have to include a number of previous methods (multiple variants of the string and general alignment kernels, dynamic time warping, naive use of signatures), for most of which there is no freely available code with interface to a machine learning toolbox, and benchmark methods (order-agnostic baselines such as summary aggregation and chunking; distributional regression; naive baselines).

15. For example, if k is the Euclidean scalar product, U (i) can be taken as the raw data matrix, where rows are diﬀerent time points and columns are features.

Kernels for sequentially ordered data

7.1. Validation and prediction set-up

7.1.1. Prediction tasks

In all data sets, samples are multi-variate (time) series. All learning tasks are supervised classiﬁcation tasks of predicting class labels attached to series of equal length.

7.1.2. Prediction methods

For prediction, we use eps-support vector classiﬁcation (as available in the python/scikitlearn package) on the kernel matrices obtained from the following kernels:

(1.a) the Euclidean kernel k(x, y) = x, y . This kernel has no parameters.

(1.b) the Gaussian kernel k(x, y) = exp 1

2γ2 x y 2 . This kernel has one parameter, a scaling constant γ.

(2.a) the (truncated) discretized signature kernel k+ m over the linear/Euclidean kernel k(x, y) = γ x, y . This kernel has two parameters, a scaling constant γ, and the truncation degree m.

(2.b) the (truncated) discretized signature k+ m over the Gaussian kernel k(x, y) = θ exp 1

2γ2 x y 2 . This kernel has three parameters: scaling constants γ and θ, and truncation degree m.

(1.a) and (1.b) are considered standard kernels, (2.a) and (2.b) are discretized signature kernels. Note that the kernels (1.a) and (1.b) can only be applied to sequential data samples of equal length which is the case for the data sets considered. Even though (1.a), (1.b) may be applied to sequences of same length, they do not use any information about their ordering: both the Euclidean and the Gaussian kernel are invariant under (joint) permutation of the order of the indexing in the arguments. Another subtlety is that the discretized signature kernels (2.a), (2.b) do use information about the ordering of the sequences, but only for a truncation m 2. For m = 1, the kernel corresponds to choosing the increment/mean aggregate feature (Euclidean) or a type of distributional classiﬁcation (Gaussian). We will therefore explicitly compare truncation degrees 1 versus 2 and higher, to enable us to make a statement about whether using the order information was beneﬁcial (or not).

7.1.3. Tuning and error estimation

In all experiments, we use nested (double) cross-validation for parameter tuning (inner loop) and estimation of error metrics (outer loop). In both instances of cross-validation, we perform uniform 5-fold cross-validation. Unless stated otherwise, parameters are tuned on the tuning grid given in Table 3 (when applicable). Kernel parameters are the same as in the above section prediction mehods . The best parameter is selected by 5-fold cross-validation, as the parameter yielding the minimum test-f1-score, averaged over the ﬁve folds.

Kir aly and Oberhauser

parameter range kernel param. γ 0.01, 0.1, 1 kernel param. θ 0.01, 0.1, 1 truncation degree m 1,2,3 SVC regularizer 0.1, 1, 10, 100, 1000

Table 3: Tuning grid

7.1.4. Error metrics

The out-of-sample classiﬁcation error is reported as precision, recall, and f1-score of out-ofsample prediction on the test fold. Errors measures are aggregated with equal weights on classes and folds. These aggregates are reported in the result tables.

7.2. Experiment: Classifying hand movements

We performed classiﬁcation with the eps-support vector machine (SVC) on the hand movements data set from UCI Sapsanis et al. (2013). The ﬁrst database in the data set which we considered for this experiment contains, for each of ﬁve subjects (two male, three female) 180 samples of hand movement s EMG recordings. Each sample is a time series in two variables (channels) at 3.000 time points. The time series fall into six classes of distinct types of hand movement (spherical, tip, palmar, lateral, cylindrical, hook). For each subject, 30 samples of each class were recorded. Hence, for each subject, there is a total of 180 sequences in X3000 with X = R2. For each of the ﬁve subjects, we conducted the classiﬁcation experiment as described in Section 7.1, comparing prediction via SVC using one of the following kernels: (1.a) the Euclidean kernel, (1.b) the Gaussian kernel, (2.a) the sequentialized Euclidean kernel. For the non-sequential kernels (1.a), (1.b), prediction was performed with and without prior standardization of the data. For the sequential kernel, the tuning grid was considered in two parts: a truncation degree of m = 1, corresponding to mean aggregation, and truncation degrees of m = 2, 3, corresponding to the case where genuine sequence information is used. Further, we use the low-rank speed, Algorithm 5, to compute the Gram matrix. The results are reported in Tables 4 to 8. Jackknife standard errors (pooling the ﬁve folds) are all 0.04 or smaller. Baseline performance of an uninformed estimator is 1/6 0.17. One can observe that for all ﬁve subjects, SVC with sequential kernel of degree 2 or higher

method precision recall f1-score (1.a) linear 0.37 0.38 0.36 (1.a) linear, standardized 0.33 0.32 0.29 (1.b) Gaussian 0.57 0.59 0.56 (1.b) Gaussian, standardized 0.54 0.50 0.50 (2.a) mean aggregation 0.19 0.20 0.18 (2.a) sequential, degree 2 0.87 0.86 0.86

Table 4: female1.mat

outperforms SVC using any of the other kernels not using any sequence information. The sequence kernel outperforms the reported methods from the original paper Sapsanis et al. (2013) as well (Figures 11 and 12).

Kernels for sequentially ordered data

method precision recall f1-score (1.a) linear 0.47 0.39 0.37 (1.a) linear, standardized 0.31 0.28 0.27 (1.b) Gaussian 0.71 0.71 0.70 (1.b) Gaussian, standardized 0.59 0.58 0.56 (2.a) mean aggregation 0.18 0.20 0.18 (2.a) sequential, degree 2 0.94 0.97 0.95

Table 5: female2.mat

method precision recall f1-score (1.a) linear 0.48 0.46 0.46 (1.a) linear, standardized 0.47 0.42 0.43 (1.b) Gaussian 0.66 0.64 0.63 (1.b) Gaussian, standardized 0.54 0.51 0.50 (2.a) mean aggregation 0.26 0.23 0.20 (2.a) sequential, degree 2 0.96 0.96 0.96

Table 6: female3.mat

method precision recall f1-score (1.a) linear 0.37 0.33 0.33 (1.a) linear, standardized 0.38 0.36 0.36 (1.b) Gaussian 0.59 0.57 0.57 (1.b) Gaussian, standardized 0.53 0.54 0.53 (2.a) mean aggregation 0.20 0.18 0.17 (2.a) sequential, degree 2 0.96 0.96 0.96

Table 7: male1.mat

method precision recall f1-score (1.a) linear 0.36 0.33 0.32 (1.a) linear, standardized 0.37 0.29 0.27 (1.b) Gaussian 0.72 0.71 0.70 (1.b) Gaussian, standardized 0.34 0.39 0.35 (2.a) mean aggregation 0.22 0.23 0.20 (2.a) sequential, degree 2 0.93 0.93 0.93

Table 8: male2.mat

7.3. Experiment: Pendigits

We performed classiﬁcation on the pendgits data set from the UCI repository16. It contains 10992 samples of digits between 0 and 9 written by 44 diﬀerent writers with a digital pen on a tablet. One sample consists of a pair of horizontal and vertical coordinates of sampled at 8 diﬀerent time points, hence we deal with a sequence in X8 with X = R2. The data set comes with a pre-speciﬁed training fold of 7494 samples, and a test fold of 3498 samples. Estimation of the prediction error is performed in this validation split, while tuning is done as described via nested 5-fold cross-validation, inside the pre-speciﬁed training fold. We compared prediction via SVC using one of the following three kernels: (2.a) the sequentialized Euclidean kernel, and (2.b) the sequentialized Gaussian kernel. For both, the truncation degree was set to m = 4. The results are reported in Table 9. Jackknife standard errors (pooling the ﬁve folds) are all 0.01 or smaller. Baseline performance of an uninformed estimator is 1/10 0.10.

16. https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

Kir aly and Oberhauser

method\method precision recall f1-score sequential, linear 0.91 0.90 0.89 sequential, Gaussian 0.97 0.97 0.97

Table 9: Pendigits

The quality of the SVC prediction with the sequentialzed linear kernel is comparable to results reported in Diehl (2013). It is outperformed by SVC prediction with the sequentialized Gaussian kernel. The latter performance is similar to the baseline performance of k-nearest neighbors reported in the documentation of the pendigits data set.

Acknowledgments

FK acknowledges support by the Alan Turing Institute, under EPSRC grant EP/N510129/1. HO is grateful for support by the Oxford-Man Institute of Quantitative ﬁnance and would like to thank Ilya Chevyrev for a careful proofreading and helpful remarks and discussions.

Appendix A. Signatures, and proofs of Theorem 1 and Theorem 3

A.1. Signatures

Deﬁnition A.1 Let E be a normed space and denote with L (H, E) the set of continuous linear maps from H to E. Given x C1([0, 1], H) and y C1 ([0, 1] , L (H, E)), the Riemann Stieltjes integral of y over [a, b] [0, 1] is deﬁned as the element in E given as

a y dx := lim mesh(π) 0

i=1 y(ti) (x(ti+1) x(ti))

where the limit is taken over all partitions π = {a t1 < < tl b} and mesh(π) := maxi=1,...,l 1 |ti+1 ti|. We also use the shorter notation y dx if the integration domain is clear from the context.

See (Lyons, 2004, Theorem 1.16) for a proof of a more general result. Applying the above with E being the subspace of Q m 0 H m of square summable tensors and the linear map being tensor multiplication, gives the series

S(x) := (Sm(x))m 0 := ˆ dx m

with dx 0 := 1 and dx (m+1) := dx m.

A.2. Proof of Theorem 1

This is more or less a folk theorem in rough path theory, but we recall it for readers not familiar with the subject. Point (1) of Theorem 1 follows since S(h) is itself a solution of a linear ODE driven by h and this solution map is continuous, see Lyons (2004). Point (2) is

Kernels for sequentially ordered data

more involved and we refer to Hambly and Lyons (2010). For the last point, Point (3), the key insight is that the series S(h) behaves like monomials.

Theorem 5 (Shuﬄe product) Let h C1([0, 1], H), then

ˆ dh m ˆ dh m = X

σ σ ˆ dh (m+m ) .

Here the sum is taken over all ordered shuﬄes OSm,m which are deﬁned as

σ : σ permutation of 1, . . . , m + m , σ (1) < < σ (m) , σ (m + 1) < < σ m + m .

and σ OSm,m acts on H (m+m ) as σ ei1 eim+m = eσ(i1) eσ(im+m ).

The proof of above is given in (Lyons, 2004, Theorem 2.29 (ii)). A direct consequence by applying Stone Weierstrass (in analogy to classic monomials) is that linear functionals of signatures approximate nonlinear functions of paths aribtrary well, thus showing Point (3) of Theorem 1.

A.3. Proof of Theorem 3 and Corollaries 4.3 and 4.4

Deﬁnition A.2 Denote with ℓ [a,b] the n-simplex over the interval [a, b], that is

ℓ [a,b] = {(ti)ℓ i=1 : a = t1 < < tℓ= 1},

and with [a,b] = S ℓ 2 ℓ [a,b]. If the interval [a, b] is clear from the context is clear, then we just write ℓand .

The following notation for sequences becomes useful.

Notation A.3 For n 1 denote

1. [n] := {1, . . . , n},

2. sequences in [n] as i = (i1, . . . , il) [n]l and call |i| = l the length of the sequence i,

3. d(i) := max{r : i1 = i2 = = ir} the number of repetitions of the ﬁrst element in the sequence i where by convention d(i) = 0 if i1 = i2,

4. i! := n1! nk! if i consists of k = |{i1, . . . , il}| diﬀerent elements in [n] and n1, . . . , nk denote the number of times they occur in i,

5. i [n] if i = (i1, . . . , il) [n]l and i1 < < il,

6. i [n] if i = (i1, . . . , il) [n]l and i1 il,

7. i d [n] if i [n] and no element in the sequence i appears more than than d times.

Kir aly and Oberhauser

Lemma A.4 Let h C1([a, b], H), π ℓ [a,b] and deﬁne hπ = (h(ti))i=1,...,ℓ H+. For any sequence i = (ik)k=1,...,m [ℓ]m with 1 i1 im ℓdenote i := m 1 k=1 [tik, tik+1]. Then

Sm(h) S+ m(hπ) = X

where the sum is taken over all i = (ik)m k=1 with 1 i1 im ℓthat have at least one repeating index, that is ik = ik+1 for at least one 1 k m 1. Moreover, each term in above sum can be bounded as follows

i i h|[ti,ti+1] 1.

where we use i! as introduced in Notation A.3.

Proof We can decompose into a disjoint union of i over all sequences i, with 1 i1 im ℓ, hence

Sm(h) = ˆ dh m = X

Split the sum P i into a sum over i that have an repeating element, i.e. ik = ik+1 for at least one k, and a sum P i over indices that do not have repeating elements. For the latter, the integration can be done explicitly and this sum equals X

with our usual notation ih = h(ti+1) h(ti). However, this is exactly S+ m(h), hence the ﬁrst statement follows. For the ﬁrst sum over sequences i with repeating elements, denote with i1, . . . , ik the distinct indices in i, and n1, . . . , nk the total counts of their respective occurrences. Then i = k j=1 j with j := nj [t ij, t ij+1]nj and it follows that

Therefore, we obtain

j dh nj H 1

i i h|[ti,ti+1] 1.

This is enough to prove our main approximation result:

Theorem 6 Let h C1([0, 1], H) let π ℓ. Then for every m 0

Sm(h) S+ m(hπ) H m gm

Kernels for sequentially ordered data

where (gm) are the coeﬃcients in the Taylor expansion around 0 of the function

m=1 gm zm := exp(z h 1)

1 + z h|[ti,ti+1] 1

S(h) S+(hπ) exp( h 1)

1 + h|[ti,ti+1] 1

If π is chosen such that h|[ti,ti+1] 1 = 1 ℓ 1 h 1 for all i = 1, . . . , ℓ 1, then

S(h) S+(h) exp( h 1)

1 + h ℓ 1 1 (ℓ 3)!

Proof The bound for Sm(h) S+ m(hπ) H m follows from Lemma A.4 by explicitly writing out the coeﬃcient gm. Applying the triangle inequality and using this bound and one obtains

S(h) S+(hπ)

m=1 Sm(h) S+ m(hπ) H m

m=1 gm = g(1)

Finally, for the special choice of a uniform partition π

g(1) = exp h 1 1 + h 1

to which we apply Euler s approximation theorem, Theorem 7, below using that h|[a,b] 1 + h|[b,c] 1 = h|a,c 1.

The ﬁrst part of Theorem 6 implies Theorem 3, the second part implies Corollary 4.4. For readers convenience we recall Euler s well-known approximation to the exponential function.

Theorem 7 Let x R, n N. Then,

n exp(x) = g(x, n), with |g(x, n)| exp(x)

In particular, it holds that

n = exp(x),

where convergence is uniform of order O(n 1) on any compact subset of R.

Proof All statements follow from the ﬁrst, which we proceed to prove. By the binomial theorem, it holds that 1 + x

Kir aly and Oberhauser

From the deﬁnition of the binomial coeﬃcient and an elementary computation, one obtains

k! + g(x, n, k), where g(x, n, k) xk

for k n. For k n, one has xk

Putting together all inequalities and using the Taylor expansion of exp yields the claim.

To prove Corollary 4.3 we need to generalize Euler s approximation of exp x to the case when x is not divided into uniform parts, but potentially very unbalanced parts. This is done in the proposition below

Proposition A.5 Let x R, n N, x 0. Let x1, . . . , xℓ R, xi 0 for i = 1, . . . , ℓsuch that Pℓ i=1 xi = x. Then,

i=1 (1 + xi) + g(x, x1, . . . , xℓ), where 0 g(x, x1, . . . , xℓ) x exp(x) max i=1,...,ℓxi.

In particular, it holds that

lim maxi xi 0

i (1 + xi) = exp(x),

where convergence is uniform of order O(maxi xi) on any compact subset of [0, ).

Proof All statements follow from the ﬁrst, which we proceed to prove. We use Notation A.3 Writing out the product, we obtain

i=1 (1 + xi) = X

where abbreviatingly we have written xi := Q i i xi. The Taylor expansion of the exponential on the other hand yields

exp(x) = exp

Note the major diﬀerent between both sums above being the repeating indices which may occur in the expansions of exp(x). More precisely, we obtain

i=1 (1 + xi) = X

Kernels for sequentially ordered data

We further split up the sum by length of i:

i=1 (1 + xi) =

i [ℓ] |i|=m i!>1

Positivity of g follows from this equation and positivity of x. Now consider the map φ which removes the ﬁrst duplicated index in an ordered index sequence i yielding a sequence of length |i| 1. On sequences of length m, the map φ is at most m-to-one, and surjective onto sequences of length m 1. Therefore,

i [ℓ] |i|=m i!>1

i [ℓ] |i|=m 1

where X = maxi xi. Thus,

i [ℓ] |i|=m i!>1

i [ℓ] |i|=m

Comparing to the expansion of exp(x) above, one observes that the right hand side is equal to

m! = X x exp(x).

The bounds in Proposition A.5 are worse than those from Euler s classic approximation result to the exponential in the case of equal xi, by a factor of x. This is due to the fact that the bound also needs to be valid for heavily imbalanced partitions of x into xi.

Appendix B. Higher order signature kernels and noisy observations

The discretized signature kernel was deﬁned as inner product

k+ : X+ X+ R k+(x, y) = S+(kx), S+(ky)

of features S+ that approximate the signature S, thus k+ will approximate the signature kernel k : PX PX R k (x, y) = S(kx), S(ky)

when the sequences are discretizations of paths: in Section 4 we showed under Assumption 3.1 that

k+(xπ, yπ ) k (x, y) as max(mesh(π), mesh(π )) 0.

Kir aly and Oberhauser

For essentially all practically relevant kernels k : X X R Assumption 3.1 holds whever the underlying paths x, y are of bounded variation. However, a common situation is that observations are perturbed by noise which are generically of unbounded variation.

Example B.1 Let X = R and k(x, y) = x, y R (that is H = R). If x C1([0, 1], X) and B = (Bt)t [0,1] is a Brownian motion in X then S(kx+B) S(x + B) = d(x + B) m is not well-deﬁned as a Riemann Stieltjes integral.

A general strategy is to replace the signature map x 7 S(x) by a a map from paths to Q m 0 H m as follows: construct a sequence of bounded variation paths (xn) that approx-

imation x and such that t 7 t 0 dx m n converges to a Q m 0 H m-valued path which is called the (geometric) rough path lift of x.

Example B.2 Constructions for mapping a path to an element of Q m 0 H m are known for a wide range of stochastic processes; below we mention a few

Brownian motion (leading to p-rough paths for any p > 2 )

more generally, continuous Semimartingale (leading to p-rough paths for any p > 2),

fractional Brownian motion of Hurst parameter H > 1

more generally, Gaussian processes (leading to p-rough paths where p depends on the regulary of the covariation process),

Markov processes in continuous time (leading to p-rough paths with p depending on the generator of the Markov process).

For details and more examples see Lyons (2004).

The notion of a geometric rough path allows to study classes of much rougher paths by the same approach we developed for the bounded variation case: below we provide the needed modiﬁcations for obtaining a sequentialized kernel such that k+(xπ, yπ ) is still well-deﬁned as the mesh vanishes when x, y are paths of unbounded variation but a rough path lift exists (e.g. all the stochastic processes in Example B.2)

Remark B.3 A more radical approach is to assume that the we are not only given paths as data but also the ﬁrst levels of their rough path lift , see for example Crisan et al. (2013). The same recursion applies to give a kernel, however, we do not spell out details since all benchmark data sets we are aware of, only provide path increments.

Deﬁnition B.4 Let d N. We call the map

S+ (d) :H+ Y

m 0 H m, (hi)ℓ i=1 7

the discretized signature map of degree d. We denote with S+ (d,m) the projection of S+ (d) to Lm n=0 H n.

Kernels for sequentially ordered data

Deﬁnition B.5 Let k : X X R and d, m 1, d m. The discretized signature kernel over k of order d and at degree m is deﬁned as

k+ (d,m)(x, y) : X+ X+ R, k+ (d,m) (x, y) = S+ (d,m) (kx) , S+ (d,m) (ky) .

The sequentialization k+ m from Section 4 arises as special case of above, more general, deﬁnition: k+ m = k+ (d,m) for d = 1, m N.

Readers familiar with rough paths will notice that the choice m = d = p recovers the notion of (kx(t))t [0,1] lifted to a geometric p-rough path. However, for machine learning applications it is often beneﬁcial to take d = p to ensure convergence but to consider m > d and ﬁnd the optimal degree m from the data (e.g. by cross-validation).

In analogy to the order d = 1 approximations discussed in Section 4, the central mathematical identity is now

(x(ti+1) x(ti)) m

r=1 (x(tir+1) x(ti)) ˆ dx m

where we use Notation A.3.

Proposition B.7 For x, y X+, d, m 1, m d,

(a) S+ (d,m)(kx) = 1 + Pm n=1 P i d[|x| 1] |i|=n

1 i! Q|i| r=1 i kx,

(b) k+ (d,m)(x, y) = 1 + Pm n=1 P i d[|x| 1] j d[|y| 1] |i|=|j|=n

1 i!j! Q|i| r=1 ir,jr k(x, y).

where i kx := kxi+1 kxi H.

Proof By deﬁnition

S+ (d)(kx) =

1 n!( r kx) n.

Application of the (non-commutative) associative law yields

1 n!( r kx) n = X

r=1 ir kx .

Using the analogous expression for S+ (d)(y), taking the scalar product while noting

r =1 kjr = δ|i|,|j|

r=1 kir, jr ky .

and truncating at tensor degree m and restricting to i d [|x| 1] and j d [|y| 1] yields the claim.

Kir aly and Oberhauser

Remark B.8 The appearance of the i! term together with i d [|x| 1] makes a recursion formula more complex since one needs to keep track how many elements in i are equal in every recursion step. We give an eﬀective algorithm in Section B.1 that relies on multi-way recursion.

B.1. Computing the higher order discretized signature kernel

Algorithm 6 Evaluation of k+ (d,m). Input: Sequences x, y X+. A kernel k : X X R. A truncation degree m. An approximation order d with 1 d m. Output: k+ (d,m)(x, y)

1: Let ℓx |x| 1 and ℓy |y| 1

2: Compute an (ℓx ℓy) array K such that K[i, j] = i,j k(x, y)

3: Initialize an (m d d ℓx ℓy)-array A, all entries zero

4: for n = 2 to m do

5: d min(d, n 1)

6: A[n|1, 1| :, :] K (1 + A[n 1|Σ, Σ| +1, + 1])

7: for r = 2 to d do

8: A[n|r, 1| :, :] 1

r K A[n 1|r 1, Σ| :, + 1]

9: A[n|1, r| :, :] 1

r K A[n 1|Σ, r 1| +1, :]

10: for s = 2 to d do

11: A[n|r, s| :, :] 1

rs K A[n 1|r 1, s 1| :, :]

12: end for

13: end for

14: end for

15: Compute R 1 + A[m|Σ, Σ|Σ, Σ]

16: Return R

Recall that multiplications of arrays in Algorithm 6 are entry-wise (not matrix multiplications). At the end of Algorithm 6, the array A contains as elements A[m|r, s|i, j] the contributions from sub-sequences i [ℓx], j [ℓy], beginning at i and j, with start-sequences iii . . . of length r and jjj . . . of length s, and of total length at most m.

Proposition B.9 Let x, y X+, m 1 and d with 1 d m. Deﬁne ℓx = |x| 1, ℓy = |y| 1 and

An,r,s i,j := X

i d[ℓx] j d[ℓy] |i|=|j|=n

κ=1 iκ,jκ k(xiκ, yjκ)

for every r, s 1 and i [ℓx], j [ℓy] where the sum over i = (i1, . . . , in), j = (j1, . . . , jn) is additionally restricted in the following way:

i = i1 = i2 = = ir = ir+1 and j = j1 = j2 = = js = js+1.

Kernels for sequentially ordered data

By convention, set An,r,s i,j = 0 if r < n or s < n. Then,

k+ (d,m)(x, y) = 1 +

s=1 An,r,s i,j (5)

and the following equalities hold

An,r,s i,j = 1

rs i,j k(x, y) An 1,r 1,s 1 i,j , (6)

An,1,1 i,j = i,j k(x, y)

s=1 An 1,r,s i ,j

An,r,1 i,j = 1

r i,j k(x, y) X

s=1 An 1,r 1,s i,j , (8)

An,1,s i,j = 1

s i,j k(x, y) X

r=1 An 1,r,s 1 i ,j , (9)

for r, s 2, i [ℓx], j [ℓy].

Proof By Proposition B.7

k+ (d,m)(x, y) = 1 +

i d[ℓx] j d[ℓy] |i|=|j|=n

r=1 ir,jr k(x, y)

i d[ℓx] j d[ℓy]

1 i!j! i1,j1 k(x, y) (1 + i2,j2 k(x, y) (1 + (1 + im,jm k(x, y))) )

and the ﬁrst equality shows the identity (5) by explicitly summing over that starting point of sequences and the number of repeated elements. Equality (6) follows directly by deﬁnition of An,r,s i,j and that for i = (i1, i2, . . . , in) with d(i) = r and i = (i2, . . . , in) we have i! = 1

r(i !). Equality (7) follows since

{i = (i1, . . . , in) : d(i) = 0, i1 = i} = {(i, i ) : i = (i 1, . . . , i n 1), i 1 > i}

r=0 {(i, i ) : i = (i 1, . . . , i n 1), i 1 > i, d(i ) = r}

where in above equalities we implicitly additionally assume that all indices i, i are increasing (that is, i1 in, i 1 i n 1). Equalities (8),(9) follow similarly.

Remark B.10 The computational cost of Algorithm 6 is O(d2m|x||y|) elementary arithmetic operations (= the number of loop elements) and O(d2|x||y|) units of elementary storage (when freeing up space for array entries directly after the last time they are read out).

Remark B.11 The higher order Algorithm 6 can be combined with the low-rank techniques analogous to the treatment of the order d = 1 algorithm in Section 6 by applying the lowrank representation to the matrices/2D-arrays A[m|i, j| :, :]. Since all assignments and manipulations can be re-phrased in those matrices, the same strategy applies.

Claus Bahlmann, Bernard Haasdonk, and Hans Burkhardt. Online handwriting recognition with support vector machines-a kernel approach. In Frontiers in handwriting recognition, 2002. proceedings. eighth international workshop on, pages 49 54. IEEE, 2002.

Dan Crisan, Joscha Diehl, Peter Friz, and Harald Oberhauser. Robust ﬁltering: multdimensional noise and multidimensional observation. Annals of Applied Probability, 23(5): 2139 2160, 2013.

Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American mathematical society, 39(1):1 49, 2002.

Marco Cuturi. Fast global alignment kernels. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 929 936, 2011.

Marco Cuturi, J-P Vert, Oystein Birkenes, and Takashi Matsui. A kernel for time series based on global alignments. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 2, pages II 413. IEEE, 2007.

Joscha Diehl. Rotation invariants of two dimensional curves based on iterated integrals. Co RR, abs/1305.6883, 2013. URL http://arxiv.org/abs/1305.6883.

Toni Giorgino. Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of statistical Software, 31(7):1 24, 2009.

Ben Hambly and Terry J Lyons. Uniqueness for the signature of a path of bounded variation and the reduced path group. Annals of Mathematics, 171(1):109 167, 2010.

David Haussler. Convolution kernels on discrete structures. Technical report, Citeseer, 1999.

Joseph B Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM review, 25(2):201 237, 1983.

Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. Journal of Machine Learning Research, 2004.

Daniel Levin, Terry J Lyons, Hao Ni, et al. Learning from the past, predicting the statistics for the future, learning an evolving system. ar Xiv preprint ar Xiv:1309.0260, 2013.

Huma Lohdi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classiﬁcation using string kernels. Journal of Machine Learning Research, 2002.

Terry J Lyons. Diﬀerential equations driven by rough paths. Springer Berlin Heidelberg New York, 2004.

Terry J Lyons, Hao Ni, and Harald Oberhauser. A feature set for streams and an application to high-frequency ﬁnancial tick data. In Proceedings of the 2014 International Conference on Big Data Science and Computing, page 5. ACM, 2014.

Hiroshi Shimodaira Ken-ichi Noma. Dynamic time-alignment kernel in support vector machine. Advances in neural information processing systems, 14:921, 2002.

Anastasia Papavasiliou, Christophe Ladroue, et al. Parameter estimation for rough diﬀerential equations. The Annals of Statistics, 39(4):2047 2073, 2011.

Hiroaki Sakoe. Two-level dp-matching a dynamic programming-based pattern matching algorithm for connected word recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on, 27(6):588 595, 1979.

Hiroaki Sakoe and Seibi Chiba. A similarity evaluation of speech patterns by dynamic programming. In Nat. Meeting of Institute of Electronic Communications Engineers of Japan, page 136, 1970.

Christos Sapsanis, George Georgoulas, and Anthony Tzes. EMG based classiﬁcation of basic hand movements based on time-frequency features. In Control & Automation (MED), 2013 21st Mediterranean Conference on, pages 716 722. IEEE, 2013.

Bernhard Sch olkopf and Alexander J Smola. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2002.

John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.

Kilho Shin and Tetsuji Kuboyama. A generalization of Haussler s convolution kernel: mapping kernel. In Proceedings of the 25th international conference on Machine learning, pages 944 951. ACM, 2008.

Weixin Yang, Lianwen Jin, and Manfei Liu. Character-level chinese writer identiﬁcation using path signature feature, dropstroke and deep CNN. abs/1505.04922, 2015. URL http://arxiv.org/abs/1505.04922.