# characteristic_kernels_and_infinitely_divisible_distributions__3e41b9a9.pdf

Journal of Machine Learning Research 17 (2016) 1-28 Submitted 3/14; Revised 5/16; Published 9/16

Characteristic Kernels and Inﬁnitely Divisible Distributions

Yu Nishiyama ynishiyam@gmail.com The University of Electro-Communications 1-5-1 Chofugaoka, Chofu, Tokyo 182-8585, Japan

Kenji Fukumizu fukumizu@ism.ac.jp The Institute of Statistical Mathematics 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan

Editor: Ingo Steinwart

We connect shift-invariant characteristic kernels to inﬁnitely divisible distributions on Rd. Characteristic kernels play an important role in machine learning applications with their kernel means to distinguish any two probability measures. The contribution of this paper is twofold. First, we show, using the L evy Khintchine formula, that any shift-invariant kernel given by a bounded, continuous, and symmetric probability density function (pdf) of an inﬁnitely divisible distribution on Rd is characteristic. We mention some closure properties of such characteristic kernels under addition, pointwise product, and convolution. Second, in developing various kernel mean algorithms, it is fundamental to compute the following values: (i) kernel mean values m P (x), x X, and (ii) kernel mean RKHS inner products m P , m Q H, for probability measures P, Q. If P, Q, and kernel k are Gaussians, then the computation of (i) and (ii) results in Gaussian pdfs that are tractable. We generalize this Gaussian combination to more general cases in the class of inﬁnitely divisible distributions. We then introduce a conjugate kernel and a convolution trick, so that the above (i) and (ii) have the same pdf form, expecting tractable computation at least in some cases. As speciﬁc instances, we explore α-stable distributions and a rich class of generalized hyperbolic distributions, where the Laplace, Cauchy, and Student s t distributions are included.

Keywords: Characteristic Kernel, Kernel Mean, Inﬁnitely Divisible Distribution, Conjugate Kernel, Convolution Trick

1. Introduction

Let (X, B(X)) be a measurable space and M1(X) be the set of probability measures. Let H be the real-valued reproducing kernel Hilbert space (RKHS) associated with a bounded and measurable positive-deﬁnite (p.d.) kernel k : X X R. In machine learning, kernel methods provide a technique for developing nonlinear algorithms, by mapping data X1, , Xn in X to higheror inﬁnite-dimensional RKHS functions k( , X1), . . . , k( , Xn) in H (Sch olkopf and Smola, 2002; Steinwart and Christmann, 2008). Recently, an RKHS representation of a probability measure P M1(X), called kernel mean, m P := EX P [k( , X)] H (Smola et al., 2007; Fukumizu et al., 2013), or equivalently,

m P (x) = Z k(x, y)d P(y), x X (1)

c 2016 Yu Nishiyama and Kenji Fukumizu.

Nishiyama and Fukumizu

has been used to handle probability measures in RKHSs. The kernel mean enables us to introduce a similarity and distance between two probability measures P, Q M1(X), via the RKHS inner product m P , m Q H and the norm ||m P m Q||H, respectively. Using these quantities, diﬀerent authors have proposed many algorithms, including density estimations (Smola et al., 2007; Song et al., 2008; Mc Calman et al., 2013), hypothesis tests (Gretton et al. 2012, Gretton et al. 2008, Fukumizu et al. 2008), kernel Bayesian inference (Song et al. 2009, Song et al. 2010, Song et al. 2011, Fukumizu et al. 2013, Song et al. 2013, Kanagawa et al. 2016, Nishiyama et al. 2016), classiﬁcation (Muandet et al., 2012), dimension reduction (Fukumizu and Leng, 2012), and reinforcement learning (Gr unew alder et al. 2012, Nishiyama et al. 2012, Rawlik et al. 2013, Boots et al. 2013). In these applications, the characteristic property of a p.d. kernel k is important: a p.d. kernel is said to be characteristic if any two probability measures P, Q M1(X) can be distinguished by their kernel means m P , m Q H (Fukumizu et al., 2004; Sriperumbudur et al., 2010, 2011). For a continuous, bounded, and shift-invariant p.d. kernel on Rd with k(x, y) = κ(x y), a necessary and suﬃcient condition for the kernel to be characteristic is known via the Bochner theorem (Sriperumbudur et al., 2010, Theorem 9). As the ﬁrst contribution of this paper, we show, using the L evy Khintchine formula (Sato, 1999; F. W. Steutel, 2004; Applebaum, 2009), that if κ is a continuous, bounded, and symmetric pdf of an inﬁnitely divisible distribution P on Rd, then k is a characteristic p.d. kernel. We call such kernels convolutionally inﬁnitely divisible (CID) kernels. Examples of CID kernels are given in Example 3.4. In addition, we note some closure properties of the CID kernels with respect to addition, pointwise product, and convolution. To describe the second contribution, we brieﬂy explain what is essentially computed in kernel mean algorithms. In general kernel methods, the following computations are fundamental:

(i) RKHS function values: f(x) for f H, x X,

(ii) RKHS inner products: f, g H, f, g H.

If f H is represented by f := Pn i=1 wik( , Xi), w Rn, then the function value (i) f(x) = Pn i=1 wik(x, Xi) reduces to the evaluation of the kernel k(x, y). Similarly, if two RKHS functions f, g H are both represented by f := Pn i=1 wik( , Xi) and g := Pl j=1 wjk( , Xj), respectively, then the inner product (ii) f, g H = Pn i=1 Pl j=1 wi wjk(Xi, Xj) reduces to the evaluation of the kernel k(x, y), which is so-called the kernel trick k( , x), k( , y) H = k(x, y). We consider a more general case in which f, g H are represented by f := Pn i=1 wim Pi and g := Pl j=1 wjm Qj, respectively, where {m Pi}, {m Qj} H are kernel means of probability measures {Pi}, {Qj} M1(X). Kernel algorithms involving kernel means use this type of RKHS functions explicitly or implicitly. If {Pi}, {Qj} are delta measures {δXi}, {δ Xj},1

then these functions are specialized to the above kernel trick case, where mδx = k( , x). The computation of (i) f(x) = Pn i=1 wim Pi(x) and (ii) f, g H = Pn i=1 Pl j=1 wi wj m Pi, m Qj H requires the following kernel mean evaluations:2

1. A probability measure δx( ), x X is a delta measure; if x B, then δx(B) = 1; otherwise, δx(B) = 0 for B B(X). 2. If kernel means m P , m Q are also both expressed by a weighted sum, m P := Pn P i=1 ηik( , Xi) and m Q := Pn Q i=1 ηik( , Xi), { Xi}, { Xi} X, then the computation also reduces to the above kernel trick case.

Characteristic Kernels and Infinitely Divisible Distributions

(iii) kernel mean values: m P (x) for P M1(X), x X,

(iv) kernel mean inner products: m P , m Q H, P, Q M1(X).

Note that the kernel mean value (1) and the kernel mean inner product m P , m Q H = R k(x, y)d P(x)d Q(y) involve an integral, and their rigorous computation is not tractable in general. The second contribution of this paper is to provide some classes of p.d. kernels and parametric models P, Q PΘ := {Pθ|θ Θ} such that the kernel computation of (iii) and (iv) can be reduced to a kernel evaluation, where tractable computation is considered. For a shift-invariant kernel k(x, y) = κ(x y), x, y Rd on Rd, as shown in Lemma 2.5, the computation of (iii) and (iv) reduces to the following convolution:

(iii) kernel mean values: m P (x) = (κ P)(x),

(iv) kernel mean inner products: m P , m Q H = (κ P Q)(0) = (κ P Q)(0),

where P and Q are the dual of P and Q, respectively.3 This convolution representation motivates us to explore a set of parametric distributions PΘ that is closed under convolution, namely, a convolution semigroup (PΘ, ) M1(Rd), where κ is a density function in PΘ. To illustrate the basic idea, let us consider Gaussian distributions PΘ as a parametric class, which is closed under convolution, and a Gaussian kernel. For simplicity, we consider the case of scalar variance matrices σ2Id. Let Nd(µ, σ2Id) and fd(x|µ, σ2Id) denote the ddimensional Gaussian distribution with mean µ and variance-covariance matrix σ2Id, and its pdf, respectively. If P and Q are Gaussian distributions Nd(µP , σ2 P Id) and Nd(µQ, σ2 QId), respectively, and k is given by the pdf fd(x y|0, τ 2I), it is easy to see that m P (x) = fd(x|µP , (σ2 P + τ 2)Id) and m P , m Q H = fd(µP |µQ, (σ2 P + σ2 Q + τ 2)Id). The kernel mean value and inner product are thus reduced to simply evaluating Gaussian pdfs, which are given by a parameter update following a speciﬁc rule. This type of computation appears in various applications: to list a few, Muandet et al. (2012) proposed a support measure classiﬁcation by considering kernels k(P, Q) between two input probability measures P, Q, including Gaussian models; Song et al. (2008) and Mc Calman et al. (2013) considered an approximation of a (target) probability measure P with a Gaussian mixture Pθ, via an optimization problem ˆθ = argminθ||m P m Pθ||2 H. The parametric expression of (iii) and (iv) is especially useful for the optimization of θ in the class of distributions. Other such applications are given in Section 5. We generalize this closedness or conjugacy4 of Gaussians with respect to kernel means and explore other cases in CID kernels. We then introduce a conjugate kernel k to parametric models Pθ and a convolution trick, so that (iii) and (iv) have the same density form, i.e., there is some parameter update in the class. If P, Q are delta measures δx, δy, then the convolution trick simpliﬁes to the kernel trick. See Proposition 4.2 for a description. While a general perspective is obtained from the convolution semigroup (I(Rd), ) of inﬁnitely divisible distributions, the pdfs of I(Rd) are not tractable in general. We then

3. A probability measure P M1(Rd) is called a dual of P M1(Rd) if P(B) = P( B) for every B B(Rd), where B := { x : x B} (Sato, 1999, p.8) 4. Here, the term conjugacy is an analogy of the conjugate prior in the Bayes theorem, where the prior and posterior have the same pdf form in a probabilistic model.

Nishiyama and Fukumizu

explore smaller convolution sub-semigroups (PΘ, ) (I(Rd), ) having a small number of parameters. In particular, we focus on the well-known α-stable distributions Sα(Rd) for each α (0, 2] and generalized hyperbolic (GH) distributions GH(Rd), which include Laplace, Cauchy, and Student s t distributions. For each α (0, 2], the class Sα(Rd) is closed under convolution. The GH class has various convolutional properties, as given in Proposition 4.5. As in the Gaussian cases, the computation of (iii) and (iv) is realized by the evaluation of pdfs, i.e., evaluation of conjugate kernels, after a parameter update. Unfortunately, these conjugate kernels are not generally tractable. However, we can ﬁnd some subclasses of tractable conjugate kernels. See Section 6 for a discussion on the computation of conjugate kernels. Note that α-stable and GH distribution classes have many applications; applications of Sα(Rd) are listed in Nolan (2013a), and the GH distributions have been applied, e.g., to mathematical ﬁnance with the L evy processes (Schoutens, 2003; Cont and Tankov, 2004; Barndorﬀ-Nielsen and Halgreen, 1990; Madan et al., 1998; Barndorﬀ-Nielsen, 1998; Barndorﬀ-Nielsen and Prause, 2001; Carr et al., 2002). Note also that the Mat ern kernel (Rasmussen and Williams, 2006, Section 4.2.1), often used in machine learning, is included in this GH class. The rest of this paper is organized as follows. In Section 2, we review the notions of kernel means, characteristic kernels, and related matters. In Section 3, we show that the CID kernels are characteristic p.d. kernels on Rd. In addition, we present the closedness property with respect to addition, pointwise product, and convolution. In Section 4, we introduce the absorbing, conjugate kernel and convolution trick for convolution semigroups of inﬁnitely divisible distributions. Section 5 lists some motivating examples of kernel machine algorithms involving kernel means and parametric models. Section 6 notes the computation of the pdfs of conjugate kernels to realize the convolution trick.

2. Preliminaries: Kernel Means and Characteristic Kernels

In this section, we review kernel means and characteristic kernels restricted to Rd. Let Pd be the set of d d p.d. matrices. Let ||x||Σ =

x Σx, x Rd, and Σ Pd. Let L1(Rd) be the absolutely integrable function space on Rd. Let Cb(Rd) be the continuous and bounded function space on Rd. A symmetric function k : Rd Rd R is called a p.d. kernel on Rd if, for any n N, x1, . . . , xn Rd, the matrix Gij = k(xi, xj), i, j {1, . . . , n} is positive-semideﬁnite. Throughout this paper, we assume a p.d. kernel k is on Rd. It is known (Aronszajn, 1950) that every p.d. kernel k has the unique RKHS H, which is a Hilbert space of functions f : Rd R, satisfying the following: (i) k( , x) H, x Rd; (ii) Span{k( , x)|x Rd} is dense in H; and (iii) the reproducing property holds:

f(x) = f, k( , x) H, f H, x Rd,

where , H denotes the inner product of H. The map Φ : Rd H; x 7 k( , x) is called a feature map. A p.d. kernel k is called bounded if supx Rd k(x, x) < . A p.d. kernel k is bounded if and only if every f H is bounded (Steinwart and Christmann, 2008, Lemma 4.23). A p.d. kernel k is called separately continuous if k( , x) : Rd R is continuous for all x Rd. A p.d. kernel k is bounded and separately continuous if and only if every f H is a bounded

Characteristic Kernels and Infinitely Divisible Distributions

and continuous function, i.e., H Cb(Rd), (Steinwart and Christmann, 2008, Lemma 4.28). A p.d. kernel k is called continuous if k is separately continuous and x 7 k(x, x), x Rd is continuous (Steinwart and Christmann, 2008, Lemma 4.29). If a p.d. kernel k is continuous, the RKHS H is separable (Steinwart and Christmann, 2008, Lemma 4.33). A p.d. kernel k is called shift-invariant if there exists a function κ : Rd R such that k(x, y) = κ(x y), x, y Rd. The function κ is called a p.d. function. A p.d. function κ on Rd is characterized by the Bochner theorem:

Theorem 2.1 (Bochner, 1959) (Wendland, 2005, Theorem 6.6) A continuous function κ : Rd C is positive deﬁnite if and only if it is the Fourier transform F(Λ) of a ﬁnite nonnegative Borel measure Λ on Rd:

Rd e 1w xdΛ(w), x Rd.

Let Kcb(Rd) Cb(Rd) denote the set of continuous bounded p.d. functions. A p.d. kernel k is called radial if there exists a function κ : [0, ) R such that k(x, y) = κ(||x y||), x, y Rd. A radial kernel k is given by

k(x, y) = κ(||x y||) = Z

[0, ) e t||x y||2dν(t), x, y Rd, (2)

where ν(t) is a ﬁnite nonnegative Borel measure on the Borel sets B([0, )). A p.d. kernel k is called elliptical if k(x, y) = κ(||x y||Σ), x, y Rd, Σ Pd. Let M1(Rd) be the set of Borel probability measures on (Rd, B(Rd)). An RKHS element m P H with a p.d. kernel k is called a kernel mean of a probability measure P M1(Rd) if there exists an expectation of the feature map:

m P := EX P [Φ(X)] = EX P [k( , X)] H, P M1(Rd).

If k is a bounded and continuous p.d. kernel, then the feature map Φ : Rd H is Bochner P-integrable for all P M1(Rd), since EX P [||k( , X)||H] = EX P [ p

k(X, X)] < for all P M1(Rd) (Steinwart and Christmann, 2008, p. 510). Throughout this paper, we assume a bounded and continuous p.d. kernel k. We write m P := {m P |P P M1(Rd)}. As mentioned in the Introduction, there are many applications using m P , since m P enables us to introduce a similarity and distance between probability measures P, Q M1(Rd), via the Hilbert space inner product m P , m Q H and norm ||m P m Q||H, respectively, where the reproducing property is also exploited. In these applications, the characteristic kernel is important to distinguish any probability measures P, Q M1(Rd) by their kernel means m P , m Q H. The following is the deﬁnition restricted to Rd:

Deﬁnition 2.2 (Fukumizu et al., 2004)(Sriperumbudur et al., 2010, Deﬁnition 6) A bounded and continuous p.d. kernel k : Rd Rd R is called characteristic on Rd if the kernel mean map M1(Rd) H; P 7 m P is injective, i.e., m P = m Q implies P = Q for any P, Q M1(Rd).

Sriperumbudur et al. (2010) showed a necessary and suﬃcient condition for a shiftinvariant p.d. kernel k(x, y) = κ(x y), x, y Rd, κ Kcb(Rd), to be characteristic via the Bochner theorem:

Nishiyama and Fukumizu

Theorem 2.3 (Sriperumbudur et al., 2010, Theorem 9) A shift-invariant p.d. kernel k with κ Kcb(Rd) is characteristic if and only if the ﬁnite nonnegative measure Λ in Theorem 2.1 has the entire support, supp(Λ) = Rd.

Let Kch cb (Rd) Kcb(Rd) denote the set of such characteristic p.d. functions on Rd. The convolution f g of two functions f and g is deﬁned by f g := R

Rd f( y)g(y)dy. The convolution f Q of a function f and a probability measure Q M1(Rd) is deﬁned by f Q := R

Rd f( y)d Q(y). The convolution P Q of two probability measures P, Q M1(Rd) is deﬁned by the probability measure (P Q)(B) := R

Rd P(B x)d Q(x), where B x := {z x : z B}, B B(Rd). Given a function f(x), x Rd, the function f denotes f(x) = f( x), x Rd. Given a probability measure P M1(Rd), a probability measure P M1(Rd) is called dual if P(B) = P( B), B B(Rd), where B := { x : x B} (Sato, 1999, p.8). A probability measure P is symmetric if P = P. We have the following simple equalities:

Proposition 2.4 ] f g = f g, ] f P = f P, and P Q = P Q.

Kernel mean m P and RKHS inner product m P , m Q H have the following convolution representation:

Lemma 2.5 Let k be a shift-invariant p.d. kernel with κ Cb(Rd). Then, we have the following:

1. Kernel mean m P is given by the convolution

m P = κ P H Cb(Rd), P M1(Rd).

2. The RKHS inner product m P , m Q H is given by the convolution

m P , m Q H = (κ P Q)(0) = (κ P Q)(0), P, Q M1(Rd),

where P and Q are the dual of P and Q, respectively.

Proof 1. Kernel mean m P has the following convolution representation:

Rd k(x, )d P(x) = Z

Rd κ( x)d P(x) = κ P, P M1(Rd).

Kernel mean m P H Cb(Rd) exists for all P M1(Rd) because, for κ Cb(Rd), the feature map Φ : x 7 k(x, ) is Bochner P-integrable for all P M1(Rd), as given in the deﬁnition of m P . 2. RKHS inner product m P , m Q H has the following convolution representation:

m P , m Q H = Z

Rd m P (y)d Q(y) = Z

Rd m P ( y)d Q(y) = ( m P Q)(0) = (κ P Q)(0),

where we have used Proposition 2.4 and κ = κ in the last equality. Since m P , m Q H is symmetric with respect to P and Q, then (κ P Q)(0) = (κ P Q)(0). This is also

Characteristic Kernels and Infinitely Divisible Distributions

obtained by (κ P Q)(0) = ( κ P Q)(0) = (κ P Q)(0).

In this paper, we simply consider that κ is a pdf of a probability distribution.5 Then, Lemma 2.5 motivates us to explore the set of probability distributions PΘ M1(Rd) that is closed under convolution, i.e., convolution semigroup (PΘ, ).

3. Characteristic Kernels and Inﬁnitely Divisible Distributions

In this section, we introduce CID kernels, which are deﬁned by inﬁnitely divisible distributions, and show that they are characteristic (Section 3.1). In addition, we examine some closure properties of CID kernels with respect to addition, pointwise product, and convolution (Section 3.2).

3.1 Convolutionally Inﬁnitely Divisible Kernels

We review the inﬁnite divisibility of a probability measure (Sato, 1999; F. W. Steutel, 2004; Applebaum, 2009).

Deﬁnition 3.1 (Sato, 1999, Deﬁnition 7.1, p. 31) A probability measure P M1(Rd) is called inﬁnitely divisible if, for any integer n N, there exists a probability measure Pn M1(Rd) such that P = P n n .

The support of every inﬁnitely divisible distribution P is unbounded except for delta measures {δx( )|x Rd} (Sato, 1999, Examples 7.2, p. 31). Let I(Rd) denote the set of inﬁnitely divisible distributions on Rd. I(Rd) is closed under convolution. Every inﬁnitely divisible distribution P I(Rd) has the following unique L evy Khintchine representation for the characteristic function. Let x y = min{x, y}, x, y R. Let 1B denote the indicator function on Rd with B Rd.

Theorem 3.2 (Sato, 1999, Theorem 8.1, p. 37) The characteristic function ˆP(w) of an inﬁnitely divisible distribution P I(Rd) has the following unique representation:

ˆP(w) = exp iw γ 1

eiw x 1 iw x1{|x| 1}(x) ν(dx) , w Rd, (3)

where γ Rd, A Rd d, is a symmetric nonnegative-deﬁnite matrix and ν is a measure on Rd satisfying

ν({0}) = 0 and Z

Rd (|x|2 1)ν(dx) < . (4)

5. In machine learning, normalized kernels k(x, y) := k(x,y)

k(y,y) are often used (e.g., Gaussian kernels

k(x, y) := exp( ||x y||2

2γ2 )) (Steinwart and Christmann, 2008, Lemma 4.55). However, we consider here

pdf kernels (e.g., Gaussian kernels k(x, y) := 1

(2πγ2)d exp( ||x y||2

2γ2 )) for the closedness of the pdfs of P

and m P . A scalar multiplication (c > 0) changes as follows: m P := EX P [ k( , X)] = c EX P [k( , X)] = cm P and m P , m Q H = c m P , m Q H, where f, g H = 1

c f, g H, f, g H, H (Berlinet and Thomas Agnan, 2004, p.37).

Nishiyama and Fukumizu

Conversely, for any γ Rd, symmetric nonnegative-deﬁnite matrix A Rd d, and measure ν satisfying (4), there exists an inﬁnitely divisible distribution P I(Rd).

(A, ν, γ) is called the generating triplet of P I(Rd). A is called the covariance matrix of the Gaussian factor of P I(Rd), and ν is called the L evy measure of P I(Rd). Gaussians correspond to the generating triplet (A, 0, γ). α-Stable distributions, including Cauchy distributions, correspond to generating triplet (0, ν, γ), where ν is the corresponding nonzero L evy measure. The L evy measure of the α-stable distributions is shown in Appendix A. An inﬁnitely divisible distribution P I(Rd) is symmetric if and only if (A, ν, γ) = (A, νs, 0), where νs is a symmetric L evy measure6 (Sato, 1999, p.114). Let IS(Rd) denote the set of symmetric and inﬁnitely divisible distributions on Rd. IS(Rd) is closed under convolution. Let Kid cb(Rd)( Cb(Rd) L1(Rd)) denote the set of continuous and bounded pdfs7 of symmetric inﬁnitely divisible distributions IS(Rd):

Kid cb(Rd) := {Ξ(Ps) Cb(Rd)|Ps IS(Rd)},

where Ξ : M1(Rd) L1(Rd) is a function that maps a probability measure P to its pdf f if it exists. The inﬁnitely divisible pdf κ Kid cb(Rd) can be used for a characteristic kernel as follows.

Theorem 3.3 The function k(x, y) = κ(x y), x, y Rd, κ Kid cb(Rd) is a p.d. and characteristic kernel, i.e., Kid cb(Rd) Kch cb (Rd).

Proof A probability measure P on Rd is symmetric if and only if the characteristic function ˆP(w), w Rd is real valued (Sato, 1999, p.67). If P is symmetric and inﬁnitely divisible, ˆP(w) > 0 for every w Rd from the L evy Khintchine formula (3). Since ˆP(w) is positive and has the entire support, supp( ˆP(w)) = Rd, then k is a p.d. and characteristic kernel from Theorem 2.3.

We call a p.d. kernel k in Theorem 3.3 a convolutionally inﬁnitely divisible (CID) kernel8. CID kernels include the following examples:

Example 3.4 (CID p.d. kernels) CID kernels include Gaussian kernels, Laplace kernels, Cauchy kernels, α-stable kernels for each α (0, 2] (α = 2 corresponds to Gaussian kernels; α = 1 corresponds to Cauchy kernels), sub-Gaussian α-stable kernels, Student s t kernels (Grosswald, 1976), GH kernels, normalized inverse Gaussian (NIG) kernels, variance gamma (VG) kernels (Mat ern kernel is a special case of this), tempered α-stable (TαS) kernels (Rachev et al., 2011; Rosi nski, 2007; Bianchi et al., 2010), etc.

6. A symmetric L evy measure is a L evy measure such that νs(B) = νs( B) for B B(Rd). 7. A necessary and suﬃcient condition for P IS(Rd) to have a pdf is not known (Sato, 1999, p.177). If the Gaussian factor A Rd d is full rank, then P I(Rd) has the pdf. If A = 0, see some suﬃcient conditions (Sato, 1999, Theorem 27.7, 27.10). Every nondegenerate self-decomposable distribution on Rd has the pdf (Sato, 1999, Theorem 27.13). 8. The term inﬁnite divisibility of a p.d. kernel is used in the pointwise product sense (Berg et al., 1984, Deﬁnition 2.6, p. 76), i.e., a p.d. kernel k : X X C on a nonempty set X is called inﬁnitely divisible if, for every n N, there exists a p.d. kernel kn : X X C such that k = (kn)n. The CID kernel considered here is the convolution sense κ = (κn) n.

Characteristic Kernels and Infinitely Divisible Distributions

3.2 Closure Property

In this subsection, we note some closure properties of CID and characteristic kernels with respect to addition, pointwise product, and convolution. The closure property is used, e.g., to generate a new CID and characteristic kernel. Example 3.8 shows such an example. It is known that the set of continuous and bounded p.d. kernels Kcb(Rd) is closed under addition and pointwise product as follows (Steinwart and Christmann, 2008, p. 114):

Proposition 3.5 If κ1, κ2 Kcb(Rd), then κ1 + κ2 Kcb(Rd) and κ1κ2 Kcb(Rd).

Similarly, the set of characteristic kernels Kch cb (Rd) is closed under addition and pointwise product as follows (Sriperumbudur et al., 2010, Corollary 11):

Proposition 3.6 If κ Kch cb (Rd), κ1, κ2 Kcb(Rd), and κ2 = 0, then κ + κ1, κκ2 Kch cb (Rd).

The set of CID kernels Kid cb(Rd) is closed under convolution but not closed under addition or pointwise product.

Proposition 3.7 Let κ1, κ2 Kid cb(Rd). Then, we have the following:

1. Convolution κ1 κ2 Kid cb(Rd).

2. Addition κ1 +κ2 and product κ1κ2 do not necessarily belong to Kid cb(Rd), although they are characteristic, κ1 + κ2, κ1κ2 Kch cb (Rd).

Proof 1. Let κ1 = Ξ(P1) and κ2 = Ξ(P2). Then, κ1 κ2 = Ξ(P1 P2). If P1, P2 IS(Rd) are absolutely continuous and symmetric inﬁnitely divisible measures, so is P1 P2 IS(Rd). 2. A mixture of two inﬁnitely divisible distributions is not necessarily inﬁnitely divisible. A product of two inﬁnitely divisible distributions is not necessarily inﬁnitely divisible. The counter-examples are as follows. Let κ1(x) = e |x| and κ2(x) = e x2, x R, be p.d. functions of Laplace and Gaussian kernels, respectively. Then, the product k(x) e |x|e x2

is not inﬁnitely divisible (F. W. Steutel, 2004, Example 11.13), although it is characteristic (Proposition 3.6). Let κ1(x) = 1 4 πe 1

4 x2 and κ2(x) = 1 4

8 x2, x R, be Gaussian kernels; then, the addition κ1 + κ2 is not inﬁnitely divisible (F. W. Steutel, 2004, Example 11.15), although it is characteristic (Proposition 3.6). Many examples can be found in F. W. Steutel (2004).

As given in Proposition 3.7, the inﬁnite divisibility is not closed under mixing in general, although some special mixing cases preserve it (F. W. Steutel, 2004, Chapter 7). The normal mean-variance mixture with an inﬁnitely divisible mixing distribution, given in Lemma 4.4, is one of them. New CID kernels and characteristic kernels may be generated by using these closure properties. If κ = F(ˆκ) is an inﬁnitely divisible pdf with the characteristic function ˆκ, then symmetrization κ := κ κ = F(|ˆκ|2) and positive powers (κ ) λ = F(|ˆκ|2λ) (λ > 0) are also inﬁnitely divisible pdfs. The following example shows that the Laplace and symmetric Gamma kernels are CID kernels generated from an exponential distribution.

Example 3.8 (F. W. Steutel, 2004, Example 2.9) An exponential distribution P with the pdf κ(x) = α exp( αx)1[0, )(x), α > 0 is inﬁnitely divisible. The dual is κ(x) = α exp(αx)1( ,0)(x).

Nishiyama and Fukumizu

1. The symmetrization κ = κ κ has the characteristic function ˆκ (w) = ˆκ(w)ˆ κ(w) = α α iw α α+iw = α2 α2+w2 . This is a Laplace pdf κ (x) = α

2 exp( α|x|).

2. Positive powers (κ ) λ (λ > 0) have the characteristic functions (ˆκ )λ(w) = ( α2 α2+w2 )λ. If λ = 1, the pdf is the above Laplace case. If λ = 2, the pdf is given by (κ ) 2(x) = α

4 (1 + α|x|) exp( α|x|). For general λ > 0, the pdf is given by

f(x) = α2λ π(2α)λ 1

2 Γ(λ) |x µ|λ 1

2 (α|x µ|), x R

where Γ(λ) is the Gamma function and Kλ(x) is the modiﬁed Bessel function of the third kind with index λ. This is the pdf of the zero-skewed VG distribution V G1(λ, α, β = 0, µ, 1) on R, as given in Section 4.3.

The additions (κ ) λ+ κ, κ Kcb(Rd), and products (k ) λ κ, κ Kch cb (Rd), are characteristic kernels based on the closure properties.

4. Kernel Means and Inﬁnitely Divisible Distributions

In this section, we examine the kernel means of a parametric class of distributions PΘ I(Rd). As mentioned in the Introduction, we wish to compute (iii) kernel mean values m P (x), x Rd and (iv) RKHS inner products m P , m Q H for parametric models P, Q PΘ. These form a basic computation for establishing kernel machine algorithms combining kernel means and parametric models. In Section 4.1, we introduce absorbing, conjugate kernels, and convolution trick in the set of inﬁnitely divisible distributions I(Rd). In Sections 4.2 and 4.3, we focus on well-known subclasses of α-stable distributions and GH distributions, which include Laplace, Cauchy, and Student s t distributions.

4.1 Absorbing, Conjugate Kernels, and Convolution Trick

We begin by introducing the notion of absorbing and conjugate p.d kernels to particular sets of parametric models PΘ as follows:

Proposition 4.1 (absorbing & conjugate kernel) Let PΘ, QΘ M1(Rd) be two sets of parametric models such that PΘ QΘ PΘ, where Θ and Θ are ﬁnite or inﬁnite index sets. Denote by Ξ(PΘ) and Ξ(QΘ ) the sets of pdfs. Let κ Kcb(Rd) be a shift-invariant p.d. kernel. We have the following statements:

1. If κ Ξ(PΘ), then m QΘ Ξ(PΘ) holds. RKHS inner products m P , m Q H, P, Q QΘ are values of pdfs in Ξ(PΘ).

2. If κ Ξ(QΘ ), then m PΘ Ξ(PΘ) holds. RKHS inner products m P , m Q H, P, Q PΘ are not necessarily values of pdfs in Ξ(PΘ).

Proof These statements are straightforward from Lemma 2.5 and assumptions.

Statements 1 and 2 indicate an absorbing property of k with respect to parametric models. If PΘ = QΘ in Proposition 4.1, we call k (and, hence, its RKHS H) a conjugate to PΘ. A general perspective may be given by the CID kernels, where these kernels are conjugate to I(Rd) as follows:

Characteristic Kernels and Infinitely Divisible Distributions

Proposition 4.2 Let k A,νs(x, y) = κA,νs(x y), x, y Rd be a CID kernel, where κA,νs Kid cb(Rd) has a generating triplet (A, νs, 0), and let HA,νs be the RKHS given by κA,νs. Let P, Q I(Rd) be inﬁnitely divisible distributions with the generating triplets (AP , νP , γP ) and (AQ, νQ, γQ), respectively. Then, we have the following:

1. Kernel mean m P is given by an inﬁnitely divisible pdf:

m P ( ) = f( ; A + AP , νs + νP , γP ), f Ξ(I(Rd))

= k A+AP ,νs+νP (γP , ).

2. The RKHS inner product m P , m Q HA,νs is given by

m P , m Q HA,νs = f(0; A + AP + AQ, νs + νP + νQ, γQ γP )

= f(0; A + AP + AQ, νs + νP + νQ, γP γQ),

= k A+AP +AQ,νs+νP + νQ(γP , γQ),

where νP (respectively, νQ) is the dual of the L evy measure νP (respectively, νQ).

Proposition 4.2 indicates a general convolution trick. The computation of m P , m Q HA,νs is reduced to the computation of the same kernel k A+AP +AQ,νs+νP + νQ with the updated parameters of the generating triplets. If Q is a delta measure δy (i.e., AQ = 0, νQ = 0, γQ = y), then statement 2 is specialized to statement 1. If P, Q are both delta measures δx, δy (i.e., AP = AQ = 0, νP = νQ = 0, γP = x, γQ = y), then statement 2 is specialized to the kernel trick k A,νs( , x), k A,νs( , y) HA,νs = k A,νs(x, y). If P, Q and k are all Gaussians (i.e., νP = νQ = νs = 0), then statement 2 results in the computation of the same Gaussian kernel with increased variance A + AP + AQ, where the computation of Gaussian pdfs is tractable. Although Proposition 4.2 gives us a theory that kernel means m P and RKHS inner products m P , m Q are expressed with generating triplets (A, ν, γ), the computation of the general inﬁnitely divisible pdfs may be intractable. We then systematically examine smaller subsemigroups of parametric models (PΘ, ) (I(Rd), ) such that the computation of pdfs may be possible. We speciﬁcally examine well-known parametric classes of α-stable distributions and GH distributions on Rd in Sections 4.2 and 4.3, respectively.

4.2 α-stable distributions

α-Stable distributions Sα(Rd), α (0, 2], on Rd are a well-known convolution subsemigroup of inﬁnitely divisible distributions (Zolotarev, 1986; Samorodnitsky and Taqqu, 1994). α = 2 implies Gaussian distributions S2(Rd) = G(Rd), which are closed under convolution; if P and Q are N(µP , RP ) and N(µQ, RQ) with mean vectors µP , µQ and covariance matrices RP , RQ, respectively, then convolution P Q is N(µP + µQ, RP + RQ). For α (0, 2), α-stable distributions are heavy tailed, where there are many applications, as listed in Nolan (2013a). For each α (0, 2), a one-dimensional α-stable distribution Sα(σ, β, µ) is speciﬁed by a scale parameter σ > 0, a skewness parameter β [ 1, 1], and a location parameter µ R. For each α (0, 2), the set Sα(R) is closed under convolution;

Nishiyama and Fukumizu

if P and Q are two stable laws Sα(σP , βP , µP ) and Sα(σQ, βQ, µQ), respectively, then P Q

is Sα(σ, β, µ) = Sα((σα P + σα Q)1/α, βP σα P +βQσα Q σα P +σα Q , µP + µQ) (Samorodnitsky and Taqqu, 1994,

Property1.2.1). See Appendix A.2 for more details. For each α (0, 2), a d-dimensional α-stable distribution Sα(µ, Γ) is speciﬁed by a location parameter µ Rd and a spectral measure Γ on the unit sphere Sd 1 := {s Rd : ||s|| = 1} (Samorodnitsky and Taqqu, 1994, Theorem 2.3.1, p.65). For each α (0, 2), the set Sα(Rd) is closed under convolution; if P and Q are two stable laws Sα(µP , ΓP ) and Sα(µQ, ΓQ), respectively, then P Q is Sα(µP + µQ, ΓP + ΓQ). See Appendix A.1 for more details. α-Stable pdfs on Rd are intractable in general. Sub-Gaussian α-stable distributions (equivalently, elliptically contoured α-stable distributions) SGα(Rd) are a well-known subclass of Sα(Rd) (Samorodnitsky and Taqqu, 1994; Nolan, 2013b). For each α (0, 2), a sub-Gaussian α-stable distribution is speciﬁed by a location parameter µ Rd and a p.d. matrix R Rd d (Samorodnitsky and Taqqu, 1994, Theorem 2.5.2, p.78). See Appendix A.4 for more details. Sub-Gaussian 1-stable distributions imply d-dimensional Cauchy distributions CAU(Rd) (Samorodnitsky and Taqqu, 1994, Example 2.5.3, p.79). If d = 1, for each α (0, 2), sub-Gaussians SGα(R) are closed under convolution. If d > 1, for each α (0, 2), sub-Gaussians SGα(Rd) are not closed under convolution. Let us decompose SGα(Rd) into an equivalent class SGα(Rd) = S R SGα(Rd)[R], where

SGα(Rd)[R] := {P SGα(Rd) | P = SGα(µ, c R), µ Rd, c > 0}.

For each α (0, 2) and a p.d. matrix R Pd, the set SGα(Rd)[R] is closed under convolution; if P and Q are SGα(µP , c P R) and SGα(µQ, c QR), respectively, then P Q is

SGα(µP + µQ, (c

2 Q) 2 α R). Note that when α = 2, the whole set SG2(Rd) is closed. These convolution properties of α-stable distributions lead to the following conjugate pairs of α-stable kernels k and α-stable distributions PΘ.

Example 4.3 Conjugate pairs of α-stable kernels k and α-stable distributions on Rd.

1. For α = 2, let k R(x, y) = 1

(2π)d|R| exp( 1

2(x y) R 1(x y)) be a Gaussian kernel

and HR be its RKHS. Let P, Q be two Gaussians N(µP , RP ) and N(µQ, RQ), respectively. Then, the kernel mean is given by the Gaussian pdf m P = fα( |µP , R+RP ) and the RKHS inner product is given by the Gaussian pdf m P , m Q HR = f(µP |µQ, R + RP + RQ).

2. For each α (0, 2), let kα,σ(x, y) = κα,σ(x y), x, y R, be an α-stable kernel on R and Hα,σ be its RKHS. Let P, Q be two α-stable laws Sα(σP , βP , µP ) and Sα(σQ, βQ, µQ), respectively, on R. Then, the kernel mean is given by the stable pdf m P = fα( |(σα P + σα)1/α, βP σα P σα P +σα , µP ) and the RKHS inner product is given by the

stable pdf m P , m Q Hα,σ = fα(µP |(σα P + σα Q + σα)1/α, βQσα Q βP σα P σα Q+σα P +σα , µQ). If α = 1 and

β = 0, then S1(σ, 0, µ) corresponds to the Cauchy distribution.

3. For each α (0, 2), let kα,Γs(x, y) = κα,Γs(x y), x, y Rd, be an α-stable kernel on Rd, where Γs is a symmetric spectral measure, and let Hα,Γs be its RKHS. Let P, Q be

Characteristic Kernels and Infinitely Divisible Distributions

two α-stable laws Sα(µP , ΓP ) and Sα(µQ, ΓQ), respectively, on Rd. Then, the kernel mean is given by the stable pdf m P = fα( |µP , ΓP + Γs) and the RKHS inner product is given by the stable pdf m P , m Q Hα,σ = fα(µP |µQ, ΓQ + ΓP + Γs).

4. For each α (0, 2), let kα,R(x, y) = κα,R(x y), x, y Rd be a sub-Gaussian α-stable kernel on Rd and let Hα,R be its RKHS. Let P, Q SGα(Rd)[R] be two sub-Gaussian α-stable laws Sα(µP , c P R) and Sα(µQ, c QR), respectively, on Rd. Then, the kernel

mean is given by the sub-Gaussian pdf m P = fα( |µP , (c

2 P +1) 2 α R) and the RKHS inner

product is given by the sub-Gaussian pdf m P , m Q Hα,R = fα(µP |µQ, (c

2 Q +1) 2 α R). If α = 1, then S1(µ, R) corresponds to multivariate Cauchy distributions with pdf f(x) (1 + ||x µ||2 R 1) d+1

5. Tempered stable distributions can also be considered as examples (Rachev et al., 2011, Table 3.2, p. 77).

4.3 Generalized Hyperbolic Distributions

GH distributions on Rd are a rich model class that includes, e.g., NIGs, hyperbolic distributions, VG distributions, Laplace distributions, Cauchy distributions, and Student s t distributions, as special cases and limiting cases (Barndorﬀ-Nielsen and Halgreen, 1977; Prause, 1999; v. Hammerstein, 2010). A list of parametric models is found in, e.g., Prause (1999, Table 1.1 p.4). The GH and related models are applied, e.g., to mathematical ﬁnance (Schoutens, 2003; Cont and Tankov, 2004; Barndorﬀ-Nielsen and Halgreen, 1990; Madan et al., 1998; Barndorﬀ-Nielsen, 1998; Barndorﬀ-Nielsen and Prause, 2001; Carr et al., 2002). The Mat ern kernel, often used in machine learning, is a special case of the VG distributions. A GH distribution is obtained by a normal mean-variance mixture of a generalized inverse Gaussian (GIG) distribution, which is a special case of the normal mean-variance mixture of the generalized Γ-convolution (Thorin, 1978). The pdfs of GIG, GH, NIG, and VG distributions are presented in Appendix B. We start by introducing a normal mean-variance mixture distribution. Let Nd(µ, ) be a Gaussian distribution with mean vector µ Rd and covariance matrix Pd. A normal mean-variance mixture distribution P on Rd is given by

R+ Nd(µ + yβ, y )(dx)G(dy), β Rd,

where G is a mixing probability measure on R+ (v. Hammerstein, 2010, Deﬁnition 2.4, p. 78). P = Nd(µ + yβ, y ) G denotes a simple notation. The closure properties of the convolution and the inﬁnite divisibility of G are preserved as follows:

Lemma 4.4 (v. Hammerstein, 2010, Lemma 2.5, p. 68) Let G be a class of probability distributions on (R+, B+) and G, G1, G2 G.

1. If G = G1 G2 G, then

(Nd(µ1 + yβ, y ) G1) (Nd(µ2 + yβ, y ) G2) = Nd(µ1 + µ2 + yβ, y ) G.

2. If G is inﬁnitely divisible, then so is Nd(µ + yβ, y ) G.

Nishiyama and Fukumizu

A GH distribution on Rd is given by a normal mean-variance mixture with the GIG distribution:

GHd(λ, α, β, δ, µ, ) := Nd(µ + y β, y ) GIG(λ, δ, q

α2 ||β||2 ),

where the parameters imply λ R, shape parameter α > 0, skewness parameter β, scaling parameter δ, location parameter µ, and p.d. matrix Pd (see Appendices B.1 and B.2 for more details). A univariate GH distribution on R is given by letting d = 1 and = 1. The GH distribution contains the following subclasses and limiting cases. Their pdfs are found in Appendices B.3, B.4, and v. Hammerstein (2010)).

1. If λ = 1

2, then GHd( 1

2, α, β, δ, µ, )) corresponds to the NIG distribution:

NIGd(α, β, δ, µ, ) := Nd(µ + y β, y ) GIG( 1

α2 ||β||2 ).

2. If λ = d+1

2 , then GHd(d+1

2 , α, β, δ, µ, ) corresponds to the hyperbolic distribution HY Pd(α, β, δ, µ, ).

3. If λ > 0 and δ 0, then GHd(λ > 0, α, β, 0, µ, ) corresponds to the VG distribution

V Gd(λ, α, β, µ, ) := Nd(µ + y β, y ) Gamma(λ, α2 ||β||2 2 ),

where Gamma(λ, γ) is the Gamma distribution with the pdf f(x) = γλ Γ(λ)xλ 1e γx.

Furthermore, if λ = d+1

2 (i.e., the above hyperbolic case), then V Gd(d+1

2 , α, β, µ, ) corresponds to the skewed Laplace distribution

LAPd(α, β, µ, ) := Nd(µ + y β, y ) Gamma(d + 1

2 , α2 ||β||2 2 ),

with the pdf f(x) e α||x µ|| 1+ β,x µ . We have seen the case of d = 1 in Example 3.8.

4. If λ < 0, α 0, and β 0, then GHd(λ < 0, 0, 0, δ, µ, ) corresponds to the scaled and shifted t distribution with f = 2λ degrees of freedom:

td(λ, δ, µ, ) := Nd(µ, y ) i Gamma(λ, δ2

where i Gamma(λ, δ) is the inverse Gamma distribution with the pdf f(x) = xλ 1 δλΓ( λ)e δ

Furthermore, if λ = 1

2 (i.e., the above NIG case), then td( 1

2, δ, µ, ) corresponds to the multivariate Cauchy distribution

CAU(δ, µ, ) := Nd(µ, y ) i Gamma( 1

with the pdf f(x) (1 + ||x µ||2 1 δ2 ) d+1

2 , which is also shown in Example 4.3.

Characteristic Kernels and Infinitely Divisible Distributions

These classes have the following convolution properties, by using Lemma 4.4 and Proposition B.1, which are the multivariate extensions of the univariate case (v. Hammerstein, 2010, eq. (1.9), p. 14).

Proposition 4.5 For each d 1, there are the following convolution properties in the d-dimensional GH distributions:

1. NIGd(α, β, δ1, µ1, ) NIGd(α, β, δ2, µ2, ) = NIGd(α, β, δ1 + δ2, µ1 + µ2, ),

2. V Gd(λ1, α, β, µ1, ) V Gd(λ2, α, β, µ2, ) = V Gd(λ1 + λ2, α, β, µ1 + µ2, ),

3. NIGd(α, β, δ1, µ1, ) GHd(1/2, α, β, δ2, µ2, ) = GHd(1/2, α, β, δ1 + δ2, µ1 + µ2, ),

4. GHd( λ, α, β, δ, µ1, ) GHd(λ, α, β, 0, µ2, ) = GHd(λ, α, β, δ, µ1 + µ2, ),

where λ, λ1, λ2 > 0.

These convolution properties can also be obtained by looking up their characteristic functions and L evy measures in v. Hammerstein (2010, Section 1.6.4, p. 46, Section 2.3, p. 79). Properties 1 and 2 imply a convolution semigroup. Property 3 implies an absorbing property. Property 4 implies another convolution property. By observing proposition 4.5, we obtain the following conjugate, absorbing, and related pairs in GH kernels and GH distributions. The parametric models in Proposition 4.5 contain p.d. kernels κ if and only if β = 0. Each example (1 4) in the following corresponds to each property (1 4) in Proposition 4.5.

Example 4.6 Conjugate, absorbing, and related pairs in the GH class.

1. Let kα,δ, (x, y) be a shift invariant NIG p.d. kernel and Hα,δ, be the RKHS. Let P, Q be two NIG distributions NIG(α, 0, δP , µP , ) and NIG(α, 0, δQ, µQ, ), respectively. Then, the kernel mean is the NIG pdf m P = f( |α, 0, δP + δ, µP , ) and the RKHS inner product is the NIG pdf m P , m Q Hα,δ, = f(µP |α, 0, δP + δQ + δ, µQ, ). If α 0, then these correspond to the Cauchy case.

2. Let kλ,α, (x, y) be a shift invariant VG p.d. kernel9 and Hλ,α, be the RKHS. Let P, Q be two VG distributions V G(λP , α, 0, µP , ) and V G(λQ, α, 0, µQ, ), respectively. Then, the kernel mean is the VG pdf m P = f( |λP + λ, α, 0, µP , ) and the RKHS inner product is the VG pdf m P , m Q Hλ,α, = f(µP |λP + λQ + λ, α, 0, µQ, ). If

2 , λP = d+1

2 , or λQ = d+1

2 , then these correspond to the Laplace case.

3. Let kα,δ, (x, y) be a NIG kernel and Hα,δ, be the RKHS. Let P be a GH distribution GH(1/2, α, 0, δP , µP , ). Then, the kernel mean is the GH pdf m P = f( |1/2, α, 0, δP + δ, µP , ). If α 0, then the NIG kernel k0,δ, (x, y) corresponds to the Cauchy kernel.

Let k1/2,α,δ, (x, y) be a GH kernel and H1/2,α,δ, be the RKHS. Let P, Q be two NIG distributions NIG(α, 0, δP , µP , ) and NIG(α, 0, δQ, µQ, ), respectively. Then, the

9. The Mat ern kernel corresponds to = I, and α =

2ν σ (Rasmussen and Williams, 2006, Section 4.2.1) (Sriperumbudur et al., 2010, p. 1533)

Nishiyama and Fukumizu

kernel mean is the GH pdf m P = f( |1/2, α, 0, δP + δ, µP , ) and the RKHS inner product is the GH pdf m P , m Q H1/2,α,δ, = f(µP |1/2, α, 0, δP + δQ + δ, µQ, ). If α 0, then the NIG distributions, P and Q, correspond to the Cauchy distributions.

4. For λ > 0, let k λ,α,δ, (x, y) be a GH kernel and H λ,α,δ, be the RKHS. Let P be a GH distribution GH(λ, α, 0, 0, µP , ). Then, the kernel mean is the GH pdf m P = f( |λ, α, 0, δ, µP , ). If α 0, then k λ,0,δ, (x, y) corresponds to the Student s t kernel. Furthermore, if λ = 1

2, then k 1

2 ,0,δ, (x, y) corresponds to the Cauchy kernel.

For λ > 0, let kλ,α, (x, y) be a GH kernel and Hλ,α, be the RKHS. Let P be a GH distribution GH( λ, α, 0, δP , µP , ). Then, the kernel mean is the GH pdf m P = f( |λ, α, 0, δP , µP , ). If α 0, then P is the Student s t distribution. Furthermore, if λ = 1

2, then P is the Cauchy distribution.

5. Connection to Machine Learning

As mentioned in the Introduction, absorbing and conjugate kernels (Examples 4.3 and 4.6) provide a way to compute the RKHS values (i) f(x), x Rd, and the RKHS inner products (ii) f, g H when f, g H are expressed by the weighted sums of parametric kernel means, f = Pn i=1 wim Pi and g = Pl j=1 wjm Qj for {Pi}, {Qj} PΘ. Many algorithms aim to use the convolution trick. Examples include as follows:

The diﬀerence between a probability measure P M1(Rd) and a model Pθ PΘ in the RKHS norm ||m P m Pθ||H needs to be computed, e.g., for the purpose of a goodness-of-ﬁt test and model criticism (Lloyd and Ghahramani, 2015), based on the maximum mean discrepancy (MMD) (Gretton et al., 2012).

Various kernels k(P, Pθ) between a probabilistic measure P and a model Pθ, e.g.,

k(P, Pθ) = exp( ||m P m Pθ||2 H 2σ2 ) need to be computed, as in the support measure machine (Muandet et al., 2012).

Song et al. (2008) and Mc Calman et al. (2013) studied an approximation of a target probability measure P M1(Rd) with a Gaussian mixture model Pθ = Pn i=1 θi Pi via solving the following optimization problem:

ˆθ = argminθ||m P m Pθ||2 H + Ω(θ) = argminθ||m P

i=1 θim Pi||2 H + Ω(θ),

where Ω(θ) is a regularization term, λ

2||θ||2 (λ > 0). This optimization is solved by a constrained quadratic program: minθ 1

2θ (A+λIn)θ b θ subject to Pn i=1 θi = 1 and θ 0, where we then need the computation of matrix A Rn n and vector b Rn:

Aij = m Pi, m Pj

H, bj = ˆm P , m Pj

H, 1 i, j n,

for parametric kernel means {m Pi}.

As mentioned in the Introduction, the kernel Bayesian inference (KBI), which employs Bayesian inference in kernel mean form, has been proposed (Fukumizu et al. 2013,

Characteristic Kernels and Infinitely Divisible Distributions

Song et al. 2013). KBI is applied to, e.g., ﬁltering and smoothing algorithms on state space models (Fukumizu et al. 2013 Kanagawa et al. 2016, Nishiyama et al. 2016) and policy learning in reinforcement learning (Gr unew alder et al. 2012, Nishiyama et al. 2012, Rawlik et al. 2013, Boots et al. 2013). When we extend it to semiparametric KBI, which combines nonparametric inference and parametric inference, we may want to use the RKHS functions f = Pn i=1 wim Pθi H expressed by parametric kernel means {Pθi} PΘ, as is used in the model-based kernel sum rule (Mb-KSR) (Nishiyama et al., 2014).

Preimage algorithms (Mika et al., 1999; Fukumizu et al., 2013) and kernel herding algorithms (Chen et al., 2010) can also be extended to estimators f = Pn i=1 wim Pθi with parametric kernel means {Pθi}.

6. Computation of Conjugate Kernels (Convolution Trick)

In Section 4, we mathematically investigated that several convolution tricks hold within a general convolution trick (Proposition 4.2): the computation of kernel mean values and RKHS inner products is the same as the computation of p.d. kernels having diﬀerent parameters, if conjugate kernels are used. However, conjugate kernels do not provide a tractable computation in general. We then discuss the computation of the conjugate kernels: α-stable kernels and GH kernels.

It is known that α-stable pdfs do not generally have a closed-form expression except for some special cases, Gaussians (α = 2) and Cauchy (α = 1), as given in Appendix A.3. Gaussian and Cauchy kernels may be used as tractable conjugate kernels. For other αstable kernels (α = 2 and α = 1), some numerical elaborations or approximations may be needed for the computation of the pdfs. The STABLE 5.110 software allows the computation of α-stable pdfs when they are independent, isotropic, elliptical, or have discrete spectral measures Γd under some settings. More information can be found in the STABLE 5.1 software manual. For elliptically contoured α-stable sub-Gaussian kernels on any dimension Rd, the computation of pdfs is suﬃcient only to compute a one-dimensional amplitude function κ(r) in equation (2), which can be computed by, e.g., a one-dimensional numerical integration. The STABLE 5.1 software supports the computation of sub-Gaussian pdfs in dimension d < 100.

GH kernels and their subclasses are also elliptical pdfs, and the computation of the kernels is suﬃcient only to compute a one-dimensional amplitude function κ(r). VG kernels or Mat ern kernels, which are a generalization of Laplace kernels, are used for covariance kernels in Gaussian processes. GH and NIG kernels are variants of Mat ern kernels, all of which are expressed by the Bessel function of the third kind. For example, there is an R package software called ghyp on the GH distributions (Breymann and L uthi, 2013).

10. John Nolan s Page. http://academic2.american.edu/~jpnolan/stable/stable.html

Nishiyama and Fukumizu

In addition, random Fourier features (Rahimi and Recht, 2007) may be an approach to approximately compute conjugate kernels. From Proposition 4.2, we have an equality

m P , m Q HA,νs = k A+AP +AQ,νs+νP + νQ(γP , γQ) = Eω[ζω(γP )ζω(γQ) ].

An RKHS inner product (l.h.s.) may be computed by approximating the expectation of ζω(γP )ζω(γQ) (r.h.s.) with sampling ω from the characteristic function having the generating triplet (A + AP + AQ, νs + νP + νQ).

7. Conclusion

In this paper, we introduced a class of CID kernels that constitutes a large subclass in the set of shift-invariant characteristic kernels on Rd, where CID kernels are closed under convolution but not closed under addition and pointwise product. We introduced absorbing, conjugate kernels, and convolution trick with respect to parametric models, where the basic computation of kernel mean values and RKHS inner products results in the computation of the same p.d. kernels with diﬀerent parameters, which is an extension of kernel trick. Although the convolution trick may oﬀer a mathematical view, the computation of conjugate kernels is not tractable in general. We then restrict convolution trick only to tractable cases or approximately compute intractable conjugate kernels. Future works include investigating the eﬀectiveness of convolution trick in practice and developing approximation algorithms to eﬃciently compute intractable conjugate kernels.

Acknowledgments

We thank anonymous reviewers and the action editor for helpful comments. Y.N. thanks Prof. Tatsuhiko Saigo and Prof. Takaaki Shimura for a helpful discussion on inﬁnitely divisible distributions. This work was supported in part by JSPS KAKENHI (grant nos. 26870821 and 22300098), the MEXT Grant-in-Aid for Scientiﬁc Research on Innovative Areas (no. 25120012), and by the Program to Disseminate Tenure Tracking System, MEXT, Japan.

Appendix A. α-Stable Distributions

We brieﬂy review the α-stable distributions on Rd.

A.1 α-Stable Distributions on Rd

The α-stable distribution on Rd has the following characteristic function:

Theorem A.1 (Samorodnitsky and Taqqu, 1994, Theorem 2.3.1, p. 65) Let α (0, 2). Then, X = (X1, . . . , Xd) is an α-stable random vector in Rd if and only if there exists a ﬁnite measure Γ on the unit sphere Sd 1 = {s Rd : ||s|| = 1} and a vector µ0 Rd such that

Sd 1 |θ s|α 1 isgn(θ s) tan πα

2 Γ(ds) + iθ µ0 , (α = 1).

Sd 1 |θ s|α 1 + i 2

πsgn(θ s) ln |θ s| Γ(ds) + iθ µ0 , (α = 1).

Characteristic Kernels and Infinitely Divisible Distributions

The pair (Γ, µ0) is unique.

The measure Γ is called the spectral measure. See Samorodnitsky and Taqqu (1994, Section 2.3) for some examples of spectral measures. The radial sub-Gaussian distribution has a uniform spectral measure. An α-stable random vector X = (X1, . . . , Xd) has independent components if and only if its spectral measure Γ is discrete and concentrated on the intersection of the axes with the sphere Sd 1. It is known that any nondegenerate stable distribution on Rd has the C pdf (Sato, 1999, Example 28.2, p. 190). An α-stable distribution on Rd is symmetric if and only if µ0 = 0 and Γ is a symmetric measure on Sd 1 (i.e., it satisﬁes Γ(A) = Γ( A) for any A B(Sd 1)) (Samorodnitsky and Taqqu, 1994, p.73). For each α (0, 2), α-stable distributions on Rd have the generating triplet (0, ν, γ) with

Sd 1 Γ(ds) Z

0 1B(rs) dr

r1+α , B B(Rd), (5)

where Γ is the spectral measure on Sd 1 (Sato, 1999, Theorem 14.3, p. 77). The sum of L evy measures ν1 + ν2 implies the sum of spectral measures Γ1 + Γ2.

A.2 α-Stable Distributions on R

As a special case, an α-stable distribution on R has the following characteristic function:

Theorem A.2 (Samorodnitsky and Taqqu, 1994, Deﬁnition 1.1.6, p. 5) A random variable X is α-stable (α (0, 2]) in R if and only if the parameters satisfy the conditions σ 0, β [ 1, 1], and µ R such that its characteristic function has the form

ˆP(θ) = exp σα|θ|α(1 iβ(sgnθ) tan πα

2 ) + iµθ (α = 1), exp σ|θ|(1 + iβ 2

π(sgnθ) ln |θ|) + iµθ (α = 1),

where sgnθ is a sign function

1 θ > 0, 0 θ = 0, 1 θ < 0.

When α (0, 2), the parameters σ, β, and µ are unique. When α = 2, β is irrelevant, and σ and µ are unique.

An α-stable distribution on R is speciﬁed by the parameters (σ, β, µ), where σ is a scale parameter, β is a skewness parameter, and µ is a location parameter. σ = 0 implies a delta measure. For α (0, 2), an α-stable distribution is symmetric if and only if β = µ = 0 (Samorodnitsky and Taqqu, 1994, Property 1.2.5, p. 11). A 2-stable distribution is symmetric if and only if µ = 0. An α-stable density does not generally have a closed-form expression, except for some special cases. However, it is known that every nondegenerate stable distribution has the C pdf (Sato, 1999, Example 28.2, p. 190). Some known univariate α-stable pdfs, expressed by elementary functions and special functions, are given in Appendix A.3. The L evy measure ν of a univariate stable distribution is obtained by letting d = 1 in the L evy measure (5). If d = 1, then S0 = { 1, 1} and Γ = Γ({ 1})δ 1 + Γ({1})δ1, where

Nishiyama and Fukumizu

Γ({ 1}), Γ({1}) 0 and Γ({ 1})+Γ({1}) > 0 (Samorodnitsky and Taqqu, 1994, Example 2.3.3, p. 67). By substituting this into equation (5), we can obtain the L evy measure ν of a univariate stable distribution as

ν(dx) = Γ({1}) 1 x1+α 1(0, )(x)dx + Γ({ 1}) 1 |x|1+α 1( ,0)(x)dx.

A stable distribution Sα(σ, β, µ) is given with the spectral measure as

σ = (Γ({1}) + Γ({ 1})) 1 α > 0, β = (Γ({1}) Γ({ 1}))

Γ({1}) + Γ({ 1}) [ 1, 1].

The sum of L evy measures ν1+ν2 implies the sum of mass functions Γ1({ 1})+Γ2({ 1}) and Γ1({1})+Γ2({1}). We can see the convolution property Sα(σ1, β1, µ1) Sα(σ2, β2, µ2) = Sα((σα 1 + σα 2 ) 1 α , σα 1 β1+σα 2 β2 σα 1 +σα 2 , µ1 + µ2) of the univariate stable distribution from the viewpoint of the spectral measure.

A.3 Closed-Form and Special Function Form of α-Stable PDFs on R

There are three cases where the α-stable pdf on R is expressed by elementary functions:

1. The 2-stable distribution S2(σ, β, µ) is the Gaussian N(µ, 2σ2), where β has no eﬀect, with the pdf

f Gauss(x) = 1 2σ πe (x µ)2

2. The 1-stable distribution S1(σ, β = 0, µ) is the Cauchy distribution with the pdf

f Cauchy(x) = σ π((x µ)2 + σ2) , x R.

3. The 1/2-stable distribution S1/2(σ, β = 1, µ) is the L evy distribution with the pdf

f Levy(x) = σ

2π(x µ)3/2 e σ 2(x µ) , µ < x < .

There are some cases where the α-stable pdf is expressed by special functions. The following expression is found in Lee (2010). Note that kernel means m P and RKHS inner products also take these expressions. For simplicity, we only show standardized stable pdfs dstable(x; α, σ = 1, β, µ = 0).

Fresnel integrals: If (α, σ, β, µ) = (1/2, 1, 0, 0),

dstable(x; 1/2, 1, 0, 0)

Characteristic Kernels and Infinitely Divisible Distributions

where C(z) and S(z) are the Fresnel integrals

dt, S(z) = Z z

This is a symmetric stable pdf. k(x, y) = dstable(x y; 1/2, 1, 0, 0), x, y R, gives a characteristic p.d. kernel.

Modiﬁed Bessel function: If (α, σ, β, µ) = (1/3, 1, 1, 0), the one-sided continuous density is

dstable(x; 1/3, 1, 1, 0) = 1

37/4 x 3/2K1/3

39/4 x 1/2 !

where Kν(x) is a modiﬁed Bessel function of the third kind.

Hypergeometric function: If (α, σ, β, µ) = (4/3, 1, 0, 0),

dstable(x; 4

3, 1, 0, 0) = 35/4Γ(7/12)Γ(11/12) 25/2 πΓ(6/12)Γ(8/12)2F2

311/4|x|3Γ(13/12)Γ(17/12)

213/2 πΓ(18/12)Γ(15/12) 2F2

where p Fq is the (generalized) hypergeometric function

p Fq(a1, . . . , ap; b1, . . . , bq; z) =

(a1)n (ap)n

(b1)n (bq)n

with the Pochhammer symbol (a)0 = 1, (a)n = a(a + 1) . . . (a + n 1) for n N+. This is a symmetric stable pdf. k(x, y) = dstable(x y; 4

3, 1, 0, 0), x, y R, gives a characteristic p.d. kernel. If (α, σ, β, µ) = (3/2, 1, 0, 0) (the Holtsmark distribution),

dstable(x; 3

2, 1, 0, 0) = 1 πΓ(5/3)2F3

34πΓ(4/3)2F3

This is a symmetric stable pdf. The Holtsmark kernel k(x, y) = dstable(x y; 3/2, 1, 0, 0), x, y R, gives a characteristic p.d. kernel.

Whittaker function: If (α, σ, β, µ) = (2/3, 1, 0, 0),

dstable(x; 2/3, 1, 0, 0) = 1 2

3π |x| exp 2 27x2

Nishiyama and Fukumizu

where Wλ,µ(z) is the Whittaker function deﬁned as

Wλ,µ(z) = zλe z/2

Γ(µ λ + 1/2)

0 e ttµ λ 1/2 1 + t

µ λ 1/2 dt,

Re(µ λ) > 1

2, |arg(z)| < π.

This is a symmetric stable pdf. k(x, y) = dstable(x y; 2/3, 1, 0, 0), x, y R, gives a characteristic p.d. kernel. If (α, σ, β, µ) = (2/3, 1, 1, 0), the one-sided density is

dstable(x; 2/3, 1, 1, 0) =

3 π 1 |x| exp 16

If (α, σ, β, µ) = (3/2, 1, 1, 0), the α-stable density is

dstable(x; 2/3, 1, 1, 0) =

3 π 1 |x| exp x3 27 W1/2,1/6 2

27x3 , x < 0

3π|x| exp x3 27 W 1/2,1/6 2

27x3 , x > 0

Lommel function: If (α, σ, β, µ) = (1/3, 1, 0, 0),

dstable(x; 1/3, 1, 0, 0) = Re 2 exp( iπ/4)

3π|x|3/2 S0,1/3

2 exp(iπ/4)

Here, the Lommel functions sµ,v(z) and Sµ,v(z) are deﬁned by

sµ,v(z) = π

0 zµJv(z)dz Jν(z) Z z

0 zµYv(z)dz ,

Sµ,v(z) = sµ,v(z) 2µ 1Γ ((1 + µ + ν)/2)

πΓ ((ν µ)/2)

Jν(z) cos µ ν

2 π Yv(z) ,

where Jν(z) and Yν(z) are Bessel functions of the ﬁrst and second kind, respectively. This is a symmetric stable pdf. k(x, y) = dstable(x y; 1/3, 1, 0, 0), x, y R, gives a characteristic p.d. kernel.

Landau distribution: If (α, σ, β, µ) = (1, 1, 1, 0) (the Landau distribution),

dstable(x; 1, 1, 1, 0) = 1

0 e t log t xt sin(πt)dt.

A.4 Sub-Gaussian (Elliptically Contoured) α-Stable Distributions on Rd

The sub-Gaussian α-stable distribution has the following characteristic function:

Characteristic Kernels and Infinitely Divisible Distributions

Proposition A.3 (Samorodnitsky and Taqqu, 1994, Proposition 2.5.2, p. 78) Let α (0, 2). The sub-Gaussian α-stable random vector X in Rd has the characteristic function

ij=1 θiθj Rij

2 + i(θ, µ0)

where R is a p.d. matrix and µ0 Rd is a shift vector.

α = 2 and α = 1 imply the multivariate Gaussian and Cauchy distribution, respectively. For α (0, 2), the radial sub-Gaussian SGα(Rd)[I] (with identity matrix R = I) has the uniform spectral measure Γ(B) = c|B|, B B(Sd 1) in the L evy measure (5) (Samorodnitsky and Taqqu, 1994, Proposition 2.5.5, p. 79). Sub-Gaussian SGα(Rd)[R] with a p.d. matrix R is the elliptical version of the radial sub-Gaussians. Its spectral measure is given in Samorodnitsky and Taqqu (1994, Proposition 2.5.8, p. 82).

Appendix B. GH Classes on Rd

A GH distribution on Rd is given by the normal mean-variance mixture with the GIG mixing distribution. See, e.g., v. Hammerstein (2010) for more information. We here reproduce some of them.

B.1 GIG Distributions on R+

A generalized inverse Gaussian (GIG) distribution GIG(λ, δ, γ) on R+ is given by the following pdf:

d GIG(λ,δ,γ)(x) = γ

λ 1 2Kλ(δγ)xλ 1 exp 1

x + γ2x 1(0, )(x),

where Kλ(x) is the modiﬁed Bessel function of the third kind with index λ. The parameters (λ, δ, γ) take the following values:

δ 0, γ > 0, if λ > 0, δ > 0, γ > 0, if λ = 0, δ > 0, γ 0, if λ < 0,

where δ = 0 and γ = 0 correspond to limiting cases,11 which are the Gamma distribution and the inverse Gamma distribution, respectively. The GIG distributions have the following convolution properties:

Proposition B.1 (v. Hammerstein, 2010, Proposition 1.11, p. 11) Within the class of GIG distributions, the following convolution properties hold:

2, δ1, γ) GIG( 1

2, δ2, γ) = GIG( 1

2, δ1 + δ2, γ),

2, δ1, γ) GIG(1

2, δ2, γ) = GIG(1

2, δ1 + δ2, γ),

c) GIG( λ, δ, γ) GIG(λ, 0, γ) = GIG(λ, δ, γ), λ > 0,

d) GIG(λ1, 0, γ) GIG(λ2, 0, γ) = GIG(λ1 + λ2, 0, γ), λ1, λ2 > 0.

11. If λ = 0, then Kλ(x) 1

2 |λ| (x 0).

Nishiyama and Fukumizu

B.2 GH Distributions on Rd

A GH distribution has the following pdf:

d GHd(λ,α,β,δ,µ, )(x) =

a(λ, α, β, δ, µ, ) q

δ2 + ||x µ||2 1

δ2 + ||x µ||2 1

where a(λ, α, β, δ, µ, ) is the normalization constant:

a(λ, α, β, δ, µ, ) = (α2 ||β||2 )λ/2

(2π)d/2| | 1 2 αλ d/2δλKλ(δ q

α2 ||β||2 ) .

The GH parameters (λ, α, β, δ, µ, ) take the following values:

λ R, α, δ R+, β, µ Rd, Pd, δ 0, 0 ||β|| < α, if λ > 0, δ > 0, 0 ||β|| < α, if λ = 0, δ > 0, 0 ||β|| α, if λ < 0,

where δ = 0 or α = ||β|| is a limiting case. The GH distribution is symmetric if and only if β = 0 and µ = 0. The symmetric GH has the following elliptical pdf:

d SGHd(λ,α,δ, )(x) = α d 2

(2π) d 2 | | 1 2 δλKλ(δα)

δ2 + ||x||2 1

δ2 + ||x||2 1

where ν(t) in equation (2) is given by a GIG distribution.

B.3 NIG Distributions on Rd

The NIG distribution NIGd(α, β, δ, µ, ) has the following pdf (v. Hammerstein, 2010, p.74):

d NIGd(α,β,δ,µ, )(x) q

δ2 + ||x µ||2 1

δ2 + ||x µ||2 1

B.4 VG Distributions on Rd

The VG distribution V Gd(λ, α, β, µ, ) has the following pdf (v. Hammerstein, 2010, p.74):12

d V Gd(λ,α,β,µ, )(x) (||x µ|| 1)λ d

2 (α||x µ|| 1) e β,x µ .

D. Applebaum. L evy processes and stochastic calculus. second edition, Cambridge University Press, 2009.

12. The VG pdf is bounded at x = µ if and only if λ > d

Characteristic Kernels and Infinitely Divisible Distributions

N. Aronszajn. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950.

E. O. Barndorﬀ-Nielsen. Processes of normal inverse gaussian type. Finance and Stochastics, 2:41 68, 1998.

E. O. Barndorﬀ-Nielsen and K. Prause. Apparent scaling. Finance and Stochastics, 5: 103 113, 2001.

O. E. Barndorﬀ-Nielsen and C. Halgreen. Inﬁnite divisibility of the hyperbolic and generalized inverse gaussian distributions. Zeitschrift f ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 38:309 312, 1977.

O. E. Barndorﬀ-Nielsen and C. Halgreen. The variance gamma (v.g.) model for share market returns. Journal of Business, 63:511 524, 1990.

C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer, 1984.

A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publisher, 2004.

M. L. Bianchi, S.T. Rachev, Y.S. Kim, and F.J. Fabozzi. Tempered inﬁnitely divisible distributions and processes. Theory of Probability and Its Applications (TVP), Society for Industrial and Applied Mathematics (SIAM), 55(1):59 86, 2010.

S. Bochner. Lectures on fourier integrals. with an author s supplement on monotonic functions, stieltjes integrals, and harmonic analysis. In Princeton University Press, Princeton, NJ. 1959.

B. Boots, G. Gordon, and A. Gretton. Hilbert space embeddings of predictive state representations. Uncertainty in Artiﬁcial Intelligence (UAI), 2013.

W. Breymann and D. L uthi. ghyp: A package on generalized hyperbolic distributions. 2013.

P. Carr, H. Geman, D. B. Madan, and M. Yor. The ﬁne structure of asset returns: an empirical investigation. Journal of Business, 75:305 332, 2002.

Y. Chen, M. Welling, and A. Smola. Super-Samples from Kernel Herding. In Uncertainty in Artiﬁcial Intelligence (UAI). 2010.

R. Cont and P. Tankov. Financial Modelling with Jump Processes. Boca Raton: Chapman & Hall CRC Press, 2004.

K. v. Harn F. W. Steutel. Inﬁnite Divisibility of Probability Distributions on the Real Line. Monogr. Textb. Pure Appl. Math., vol. 259, Marcel Dekker Inc., 2004.

K. Fukumizu and C. Leng. Gradient-based kernel method for feature extraction and variable selection. In Annual Conference on Neural Information Processing Systems (NIPS), pages 2123 2131. 2012.

Nishiyama and Fukumizu

K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research, 5:73 99, 2004.

K. Fukumizu, A. Gretton, X. Sun, and B. Sch olkopf. Kernel Measures of Conditional Dependence. In Annual Conference on Neural Information Processing Systems (NIPS), pages 489 496. 2008.

K. Fukumizu, L. Song, and A. Gretton. Kernel bayes rule: Bayesian inference with positive deﬁnite kernels. Journal of Machine Learning Research, pages 3753 3783, 2013.

A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch olkopf, and A. Smola. A kernel statistical test of independence. In Annual Conference on Neural Information Processing Systems (NIPS). 2008.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. J. Smola. A Kernel Two-Sample Test. Journal of Machine Learning Research, 13:723 773, 2012.

E. Grosswald. The student t-distribution of any degree of freedom is inﬁnitely divisible. Zeit. Wahrsch. Verw. Gebiete, 36:103 109, 1976.

S. Gr unew alder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton. Modelling transition dynamics in MDPs with RKHS embeddings. In International Conference on Machine Learning (ICML), pages 535 542, 2012.

M. Kanagawa, Y. Nishiyama, A. Gretton, and K. Fukumizu. Filtering with State Observation Examples via Kernel Monte Carlo Filter. Neural Computation, 28:382 444, 2016.

W. H. Lee. Continuous and discrete properties of stochastic processes. Ph D thesis, The University of Nottingham, 2010.

J. R. Lloyd and Z. Ghahramani. Statistical Model Criticism using Kernel Two Sample Test. In Annual Conference on Neural Information Processing Systems (NIPS). 2015.

B. D. Madan, P. Carr, and E. C. Chang. The variance gamma process and option pricing. European Finance Review, 2:79 105, 1998.

L. Mc Calman, S. O Callaghan, and F. Ramos. Multi-modal estimation with kernel embeddings for learning motion models. In IEEE International Conference on Robots and Automation (ICRA), 2013.

S. Mika, B. Sch olkopf, A. Smola, K. M uller, M. Scholz, and G. R atsch. Kernel PCA and de-noising in feature spaces. In Annual Conference on Neural Information Processing Systems (NIPS), pages 536 542, 1999.

K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Sch olkopf. Learning from Distributions via Support Measure Machines. In Annual Conference on Neural Information Processing Systems (NIPS), pages 10 18. 2012.

Characteristic Kernels and Infinitely Divisible Distributions

Y. Nishiyama, A. Boularias, A. Gretton, and K. Fukumizu. Hilbert Space Embeddings of POMDPs. In Uncertainty in Artiﬁcial Intelligence (UAI), pages 644 653, 2012.

Y. Nishiyama, M. Kanagawa, A. Gretton, and K. Fukumizu. Model-based Kernel Sum Rule. In ar Xiv: 1409.5178. 2014.

Y. Nishiyama, A. H. Afsharinejad, S. Naruse, B. Boots, and L. Song. The Nonparametric Kernel Bayes Smoother. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS). 2016.

J. Nolan. Bibliography on stable distributions, processes and related topics. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.9970&rep= rep1&type=pdf, 2013a.

J. Nolan. Multivariate elliptically contoured stable distributions: theory and estimation. Computational Statistics, 28(5):2067 2089, 2013b.

K. Prause. The generalized hyperbolic model: estimation, ﬁnancial derivatives, and risk measures. Ph.D. thesis University of Freiburg, 1999.

S. T. Rachev, Y. S. Kim, M. L. Bianchi, and F. J. Fabozzi. Financial Models with Levy Processes and Volatility Clustering. Wiley & Sons, 2011.

A. Rahimi and B. Recht. Random Features for Large-Scale Kernel Machines. In Annual Conference on Neural Information Processing Systems (NIPS). 2007.

C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006.

K. Rawlik, M. Toussaint, and S. Vijayakumar. Path Integral Control by Reproducing Kernel Hilbert Space Embedding. International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2013.

J. Rosi nski. Tempering stable processes. Stochastic Processes and Their Applications, 117 (6):677 707, 2007.

G. Samorodnitsky and M. S. Taqqu. Stable non-Gaussian random processes : stochastic models with inﬁnite variance. Chapman & Hall, 1994.

K. Sato. L evy processes and inﬁnitely divisible distributions. Cambridge University Press, 1999.

B. Sch olkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, 2002.

W. Schoutens. L evy Processes in Finance: Pricing Financial Derivatives. Chichester: Wiley, 2003.

A. Smola, A. Gretton, L. Song, and B. Sch olkopf. A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory (ALT), pages 13 31, 2007.

Nishiyama and Fukumizu

L. Song, X. Zhang, A. Smola, A. Gretton, and B. Sch olkopf. Tailoring Density Estimation via Reproducing Kernel Moment Matching. International Conference on Machine Learning (ICML), pages 992 999, 2008.

L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems. In International Conference on Machine Learning (ICML), pages 961 968, 2009.

L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. J. Smola. Hilbert Space Embeddings of Hidden Markov Models. In International Conference on Machine Learning (ICML), pages 991 998, 2010.

L. Song, A. Gretton, D. Bickson, Y. Low, and C. Guestrin. Kernel Belief Propagation. Journal of Machine Learning Research - Proceedings Track, 15:707 715, 2011.

L. Song, K. Fukumizu, and A. Gretton. Kernel embedding of conditional distributions. IEEE Signal Processing Magazine, 30(4):98 111, 2013.

B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch olkopf. Hilbert Space Embeddings and Metrics on Probability Measures. Journal of Machine Learning Research, 11:1517 1561, 2010.

B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, Characteristic Kernels and RKHS Embedding of Measures. Journal of Machine Learning Research, 12:2389 2410, 2011.

I. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics. Springer, 2008.

O. Thorin. An extension of the notion of a generalized Γ-convolution. Scandinavian Actuarial Journal, pages 141 149, 1978.

E. A. F. v. Hammerstein. Generalized hyperbolic distributions: Theory and applications to CDO pricing. Ph.D. thesis University of Freiburg, 2010.

H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005.

V.M. Zolotarev. One-dimensional stable distributions. Translations of mathematical monographs, American Mathematical Society, 1986.