# sequential_kernelized_independence_testing__357eb019.pdf

Sequential Kernelized Independence Testing

Aleksandr Podkopaev 1 Patrick Bl obaum 2 Shiva Prasad Kasiviswanathan 2 Aaditya Ramdas 1 2

Independence testing is a classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) stop earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. Classical batch tests are not tailored for streaming data: valid inference after data peeking requires correcting for multiple testing which results in low power. Following the principle of testing by betting, we design sequential kernelized independence tests that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g., the Hilbert-Schmidt independence criterion. Our test is also valid under non-i.i.d. time-varying settings. We demonstrate the power of our approaches on both simulated and real data.

1. Introduction

Independence testing is a fundamental statistical problem that has also been studied within information theory and machine learning. Given paired observations (X, Y ) sampled from some (unknown) joint distribution PXY , the goal is to test the null hypothesis that X and Y are independent. The literature on independence testing is vast as there is no unique way to measure dependence, and different measures give rise to different tests. Traditional measures of dependence, such as Pearson s r, Spearman s ρ, and Kendall s τ,

1Statistics & Data Science and Machine Learning Departments, Carnegie Mellon University 2Amazon Research. Correspondence to: Aleksandr Podkopaev <podkopaev@cmu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

are limited to the case of univariate random variables. Kernel tests (Jordan & Bach, 2001; Gretton et al., 2005c;a) are amongst the most prominent modern tools for nonparametric independence testing that work for general X, Y spaces.

In the literature, heavy emphasis has been placed on batch testing when one has access to a sample whose size is specified in advance. However, even if random variables are dependent, the sample size that suffices to detect dependence is never known a priori. If the results of a test are promising yet non-conclusive (e.g., a p-value is slightly larger than a chosen significance level), one may want to collect more data and re-conduct the study. This is not allowed by traditional batch tests. We focus on sequential tests that allow peeking at observed data to decide whether to stop and reject the null or to continue collecting data.

Problem Setup. Suppose that one observes a stream of data: (Zt)t 1, where Zt = (Xt, Yt) iid PXY . We design sequential tests for the following pair of hypotheses:

H0 : Zt iid PXY , t 1 and PXY = PX PY , (1a)

H1 : Zt iid PXY , t 1 and PXY = PX PY . (1b)

Following the framework of tests of power one (Darling & Robbins, 1968), we define a level-α sequential test as a mapping Φ : t=1(X Y)t {0, 1} that satisfies

PH0 ( t 1 : Φ(Z1, . . . , Zt) = 1) α.

As is standard, 0 stands for do not reject the null yet and 1 stands for reject the null and stop . Defining the stopping time τ := inf {t 1 : Φ(Z1, . . . , Zt) = 1} as the first time that the test outputs 1, a sequential test must satisfy

PH0 (τ < ) α.

We work in a very general composite nonparametric setting: H0 and H1 consist of huge classes of distributions (discrete/continuous) for which there may not be a common reference measure, making it impossible to define densities and thus ruling out likelihood-ratio based methods.

Our Contributions. Following the principle of testing by betting, we design consistent sequential nonparametric independence tests. Our bets are inspired by popular kernelized dependence measures: Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005a), constrained covariance

Sequential Kernelized Independence Testing

criterion (COCO) (Gretton et al., 2005c), and kernelized canonical correlation (KCC) (Jordan & Bach, 2001). We provide theoretical guarantees on time-uniform type I error control for these tests the type I error is controlled even if the test is continuously monitored and adaptively stopped and further establish consistency and asymptotic rates for our sequential HSIC under the i.i.d. setting. Our tests also remain valid even if the underlying distribution changes over time. Additionally, while the initial construction of our tests requires bounded kernels, we also develop variants based on symmetry-based betting that overcome this requirement. This strategy can be readily used with a linear kernel to construct a sequential linear correlation test. We justify the practicality of our tests through a detailed empirical study.

We start by highlighting two major shortcomings of existing tests that our new tests overcome.

(i) Limitations of Corrected Batch tests and Reduction to Two-sample Testing. Batch tests (without corrections for multiple testing) have an inflated false alarm rate under continuous monitoring (see Appendix A.1). Na ıve Bonferroni corrections restore type I error control, but generally result in tests with low power. This motivates a direct design of sequential tests (not by correcting batch tests). It is tempting to reduce sequential independence testing to sequential two-sample testing, for which a powerful solution has been recently designed (Shekhar & Ramdas, 2021). This can be achieved by splitting a single data stream into two and permuting the X data in one of the streams (see Appendix A.2). Still, the splitting results in inefficient use of data and thus low power, compared to our new direct approach (Figure 1).

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 β

Rejection rate

Our test Seq-MMD Batch + 10-step Batch + 100-step α = 0.05

Figure 1. Valid sequential independence tests for: Yt = Xtβ + εt, Xt, εt N(0, 1). Batch + n-step is batch HSIC with Bonferroni correction applied every n steps (allowing monitoring only at those steps). Seq-MMD refers to the reduction to two-sample testing (Appendix A.2). Our test outperforms other tests.

(ii) Time-varying Independence Testing: Beyond the i.i.d. Setting. A common practice of using a permutation p-value

for batch independence testing requires (X, Y )-pairs to be i.i.d. (more generally, exchangeable). If data distribution drifts, the resulting test is no longer valid, and even under mild changes, an inflated false alarm rate is observed empirically. Our tests handle more general non-stationary settings. For a stream of independent data: (Zt)t 1, where Zt P (t) XY , consider the following pair of hypotheses:

H0 : P (t) XY = P (t) X P (t) Y , t, (2a)

H1 : t : P (t ) XY = P (t ) X P (t ) Y . (2b)

Suppose that under H0 in (2a), it holds that either P (t 1) X = P (t) X or P (t 1) Y = P (t) Y for each t 1, meaning that either the distribution of X may change or that of Y may change, but not both simultaneously. In this case, our tests control the type I error, whereas batch independence tests fail to.

Example 1. Let ((Wt, Vt))t 1 be a sequence of i.i.d. jointly Gaussian random variables with zero mean and covariance matrix with ones on the diagonal and ρ off the diagonal. For t = 1, 2, . . . and i {0, 1}, consider the following stream: ( X2t i = 2c sin(t) + W2t 1, Y2t i = 3c sin(t) + V2t 1, (3)

Setting ρ = 0 falls into the null case (2a), whereas any ρ = 0 implies dependence as per (2b). Visually, it is hard to distinguish between H0 and H1: the drift makes data seem dependent (see Appendix E.1). In Figure 2(a), we show that our test controls type I error, whereas batch test fails1.

Related Work. In addition to the aforementioned papers on batch independence testing, our work is also related to methods for safe, anytime-valid inference , e.g., confidence sequences (Waudby-Smith & Ramdas, 2023, and references therein) and e-processes (Gr unwald et al., 2020; Ramdas et al., 2022). Sequential nonparametric two-sample tests of Balsubramani & Ramdas (2016), based on linear-time test statistics and empirical Bernstein inequality for random walks, are amongst the first results in this area. While such tests are valid in the same sense as ours, betting-based tests are much better empirically (Shekhar & Ramdas, 2021).

The roots of the principle of testing by betting can be traced back to Ville s 1939 doctoral thesis (Ville, 1939) and was recently popularized by Shafer (2021). The latter work considered it mainly in the context of parametric and simple hypotheses, far from our setting. The most closely related works to the current paper are (Shekhar & Ramdas, 2021; Shaer et al., 2023; Gr unwald et al., 2023) which also handle composite and nonparametric hypotheses. Shekhar & Ramdas (2021) use testing by betting to design sequential nonparametric two-sample tests, including a state-of-the-art

1This is also related to Yule s nonsense correlation (Yule, 1926; Ernst et al., 2017), which would not pose a problem for our method.

Sequential Kernelized Independence Testing

100 200 300 400 500 Sample size

Type I error

c = 0 c = 0.1 c = 0.2 c = 0.4 α = 0.05

(a) The null is true, meaning that W V in Example 1. (Batch) HSIC: dashed lines, SKIT: solid lines.

100 200 300 400 500 Sample size

c = 0 c = 0.1 c = 0.2 c = 0.4

(b) The alternative is true with ρ = 1/2.

Figure 2. Under distribution drift (3), SKIT controls type I error under H0 and has high power under H1. Batch HSIC fails to control type I error under H0 (hence we do not plot its power).

sequential kernel maximum mean discrepancy test. Two recent works by Shaer et al. (2023); Gr unwald et al. (2023), developed in parallel to the current paper, extend these ideas to the setting of sequential conditional independence tests (H0 : X Y | Z) under the model-X assumption, i.e., when the distribution X | Z is assumed to be known. Our methods are very different from the aforementioned papers because when Z = , the model-X assumption reduces to assuming PX is known, which we of course avoid. The current paper can be seen as extending the ideas from (Shekhar & Ramdas, 2021) to nonparametric independence testing.

2. Sequential Kernel Independence Test

We begin with a brief summary of the principle of testing by betting (Shafer, 2021; Shafer & Vovk, 2019). Suppose that one observes a sequence of random variables (Zt)t 1,

where Zt Z. A player begins with initial capital K0 = 1. At round t of the game, she selects a payoff function ft : Z [ 1, ) that satisfies EZ PZ [ft(Z) | Ft 1] = 0 for all PZ H0, where Ft 1 = σ(Z1, . . . , Zt 1), and bets a fraction of her wealth λt Kt 1 for an Ft 1-measurable λt [0, 1]. Once Zt is revealed, her wealth is updated as

Kt = Kt 1(1 + λtft(Zt)). (4)

A level-α sequential test is obtained using the following stopping rule: Φ(Z1, . . . , Zt) = 1 {Kt 1/α}, i.e., the null is rejected once the player s capital exceeds 1/α. If the null is true, imposed constraints on sequences of payoffs (ft)t 1 and betting fractions (λt)t 1 prevent the player from making money. Formally, the wealth process (Kt)t 0 is a nonnegative martingale. The validity of the resulting test then follows from Ville s inequality (Ville, 1939).

To ensure that the resulting test has power under the alternative, payoffs and betting fractions have to be chosen carefully. Inspired by sequential two-sample tests of Shekhar & Ramdas (2021), our construction relies on dependence measures: m(PXY ; C), which admit a variational representation:

sup c C [EPXY c(X, Y ) EPX PY c(X, Y )] , (5)

for some class C of bounded functions c : X Y R. The supremum above is often achieved at some c C, and in this case, c is called the witness function . In what follows, we use sufficiently rich functional classes C for which the following characteristic condition holds: ( m(PXY ; C) = 0, under H0, m(PXY ; C) > 0, under H1, (6)

for H0 and H1 defined in (1). To proceed, we bet on pairs of points from PXY . Swapping Y -components in a pair of points from PXY : Z2t 1 and Z2t, gives two points from PX PY : Z2t 1 = (X2t 1, Y2t) and Z2t = (X2t, Y2t 1). We consider payoffs f(Z2t 1, Z2t) of the form:

s (c(Z2t 1) + c(Z2t)) (c( Z2t 1) c( Z2t)) , (7)

where the scaling factor s > 0 ensures that f(z, z ) [ 1, 1] for any z, z X Y. When the witness function c

is used in the above, we denote the resulting function as the oracle payoff f . Let the oracle wealth process (K t )t 0 be defined by using f along with the betting fraction

λ = E [f (Z1, Z2)] E [f (Z1, Z2)] + E [(f (Z1, Z2))2]. (8)

We have the following result regarding the above quantities, whose proof is presented in Appendix B.2.2.

Theorem 2.1. Let C denote a class of functions c : X Y R for measuring dependence as per (5).

Sequential Kernelized Independence Testing

1. Under H0 in (1a) and (2a), any payoff f of the form (7) satisfies EH0 [f(Z1, Z2)] = 0.

2. Suppose that C satisfies (6). Under H1 in (1b), the oracle payoff f based on the witness function c satisfies EH1 [f (Z1, Z2)] > 0. Further, for λ defined in (8), it holds that EH1 [log(1 + λ f (Z1, Z2)] > 0. Hence, K t a.s. + , which implies that the oracle test is consistent: PH1(τ < ) = 1, where τ = inf {t 1 : K t 1/α}.

Remark 2.2. While the betting fraction (8) suffices to guarantee the consistency of the corresponding test, the fastest growth rate of the wealth process is ensured by considering

λ K arg max λ (0,1) E [log(1 + λf (Z1, Z2)] .

Overshooting with the betting fraction may, however, result in the wealth tending to zero almost surely.

Example 2. Consider a sequence (Wt)t 1, where

( 1, with probability 3/5, 1, with probability 2/5.

In this case, we have λ K = 1/5 and E [log(1 + λ Wt)] > 0, implying that Kt a.s. + . On the other hand, it is easy to check that for λ = 2λ K we have: E[log(1 + λWt)] < 0. As a consequence, for the wealth process Kt corresponding to the pair (f , λ) it holds that Kt a.s. 0.

To construct a practical test, we select an appropriate class C for which the condition (6) holds and replace the oracle f and λ with predictable estimates (ft)t 1 and (λt)t 1, meaning that those are computed using data observed prior to a given round of the game. We begin with a particular dependence measure, namely HSIC (Gretton et al., 2005a), and defer extensions to other measures to Section 3.

HSIC-based Sequential Kernel Independence Test (SKIT). Let G be a separable RKHS2 with positive-definite kernel k( , ) and feature map φ( ) on X. Let H be a separable RKHS with positive-definite kernel l( , ) and feature map ψ( ) on Y.

Assumption 2.3. Suppose that:

(A1) Kernels k and l are nonnegative and bounded by one: supx X k(x, x) 1 and supy Y l(y, y) 1.

2Recall that an RKHS is a Hilbert space G of real-valued functions over X, for which the evaluation functional δx : G R, which maps g G to g(x), is a continuous map, and this fact must hold for every x X. Each RKHS is associated with a unique positive-definite kernel k : X X R, which can be viewed as a generalized inner product on X. We refer the reader to (Muandet et al., 2017) for an extensive recent survey of kernel methods.

(A2) The product kernel k l : (X Y)2 R, defined as (k l)((x, y), (x , y )) := k(x, x )l(y, y ), is a characteristic kernel on the joint domain.

Assumption (A1) is used to justify that the mean embeddings introduced later are well-defined elements of RKHSs, and the particular bounds are used to simplify the constants. Assumption (A2) is a sufficient condition for the characteristic condition (6) to hold (Fukumizu et al., 2007b), and we use it to argue about the consistency of our test. Under mild assumptions, it can be further relaxed to characteristic property of the kernels on the respective domains (Gretton, 2015). We note that the most common kernels on Rd: Gaussian (RBF) and Laplace, satisfy both (A1) and (A2). Define mean embeddings of the joint and marginal distributions:

µXY := EPXY [φ(X) ψ(Y )] ,

µX := EPX [φ(X)] , µY := EPY [ψ(Y )] . (9)

The cross-covariance operator CXY : H G associated with the joint measure PXY is defined as

CXY := µXY µX µY ,

where is the outer product operation. This operator generalizes the covariance matrix. Hilbert-Schmidt independence criterion (HSIC) is a criterion defined as Hilbert-Schmidt norm, a generalization of Frobenius norm for matrices, of the cross-covariance operator (Gretton et al., 2005a):

HSIC(PXY ; G, H) := CXY 2 HS . (10)

HSIC is simply the squared kernel maximum mean discrepancy (MMD) between mean embeddings of PXY and PX PY in the product RKHS G H on X Y, defined by a product kernel k l. We can rewrite (10) as sup g G H g G H 1

EPXY [g(X, Y )] EPX PY [g(X , Y )] 2 ,

which matches the form (5). The witness function for HSIC admits a closed form (see Appendix D):

g = µXY µX µY µXY µX µY G H , (11)

where µXY , µX and µY are defined in (9). The oracle payoff based on HSIC: f (Z2t 1, Z2t), is given by

g (Z2t 1) + g (Z2t) g ( Z2t 1) g ( Z2t) , (12) which has the form (7) with s = 1/2. To construct the test, we use estimators (ft)t 1 of the oracle payoff f obtained by replacing g in (12) with the plug-in estimator:

ˆgt = ˆµXY ˆµX ˆµY ˆµXY ˆµX ˆµY G H , (13)

Sequential Kernelized Independence Testing

where ˆµXY , ˆµX, ˆµY denote the empirical mean embeddings (plug-in estimators of (9)) computed at round t as3

ˆµXY = 1 2(t 1)

i=1 φ(Xi) ψ(Yi),

ˆµX = 1 2(t 1)

i=1 φ(Xi), ˆµY = 1 2(t 1)

Note that in (13) the witness function is defined as an operator. We clarify this point in Appendix D. To select betting fractions, we follow Cutkosky & Orabona (2018) who state the problem of choosing the optimal betting fraction for coin betting as an online optimization problem with exp-concave losses and propose a strategy based on online Newton step (ONS) (Hazan et al., 2007) as a solution. ONS betting fractions are inexpensive to compute while being supported by strong theoretical guarantees. We also consider other strategies for selecting betting fractions and defer a detailed discussion to Appendix C. We conclude with formal guarantees on time-uniform type I error control and consistency of HSIC-based SKIT. In fact, we establish a stronger result: we show that the wealth process grows exponentially and characterize the rate of the growth of wealth in terms of the true HSIC score. The proof is deferred to Appendix B.2.2.

Algorithm 1 Online Newton step (ONS) strategy for selecting betting fractions

Input: sequence of payoffs (ft(Z2t 1, Z2t))t 1, λONS 1 = 0, a0 = 1. for t = 1, 2, . . . do

Observe ft(Z2t 1, Z2t); Set zt = ft(Z2t 1, Z2t)/(1 λONS t ft(Z2t 1, Z2t)); Set at = at 1 + z2 t ; Set λONS t+1 := 1

2 0 λONS t 2 2 log 3 zt

Theorem 2.4. Suppose that Assumption 2.3 is satisfied. The following claims hold for HSIC-based SKIT (Algorithm 2):

1. Suppose that H0 in (1a) or (2a) is true. Then SKIT ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then it holds that Kta.s., + and thus SKIT is consistent: PH1(τ < ) = 1. Further, the wealth grows exponentially, and

3At round t, evaluating HSIC-based payoff requires a number of operations that is linear in t (see Appendix F.2). Thus after T steps, we have expended a total of O(T 2) computation, the same as batch HSIC. However, our test threshold is 1/α, but batch HSIC requires permutations to determine the right threshold, requiring recomputing HSIC hundreds of times. Thus, our test is actually more computationally feasible than batch HSIC.

Algorithm 2 HSIC-based SKIT

Input: significance level α (0, 1), data stream (Zi)i 1, where Zi = (Xi, Yi) PXY , λONS 1 = 0. for t = 1, 2, . . . do

Use Z1, . . . , Z2(t 1) to compute ˆgt as in (13); Compute HSIC payoff ft(Z2t 1, Z2t); Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

Compute λONS t+1 (Algorithm 1); end if end for

the corresponding growth rate satisfies

lim inf t log Kt

a.s. E[f (Z1,Z2)]

4 E[f (Z1,Z2)] E[(f (Z1,Z2))2] 1 , (15) where f is the oracle payoff defined in (12).

Since E [f (Z1, Z2)] = p

HSIC(PXY ; G, H) and E (f (Z1, Z2))2 1, Theorem 2.4 implies that:

1 t log Kt a.s. 1 4 HSIC(PXY ; G, H).

However, the lower bound (15) is never worse. In particular, if the variance of the oracle payoffs: σ2 = V [f (Z1, Z2)], is small, meaning that σ2 E [f (Z1, Z2)] (1 E [f (Z1, Z2)]), we get a faster rate: p

HSIC(PXY ; G, H)/4, reminiscent of an empirical Bernstein type adaptation. Up to some small constants, we show that this is the best possible exponent that adapts automatically between the lowand high-variance regimes. We do this by considering the oracle test, i.e., assuming that the oracle HSIC payoff f is known. Amongst the betting fractions that are constrained to lie in [ 0.5, 0.5], like ONS bets, the optimal growth rate is ensured by taking

λ = arg max λ [ 0.5,0.5] E [log(1 + λf (Z1, Z2))] . (16)

We have the following result about the growth rate of the oracle test, whose proof is deferred to Appendix B.2.2.

Proposition 2.5. The optimal log-wealth S := E [log(1 + λ f (Z1, Z2))] that can be achieved by an oracle betting scheme (16) which knows f from (12) and the underlying distribution satisfies:

S E [f (Z1, Z2)]

8E [f (Z1, Z2)]

3E [(f (Z1, Z2))2] 1 . (17)

Remark 2.6 (Minibatching). While our test processes the data stream in pairs, it is possible to use larger batches of

Sequential Kernelized Independence Testing

points from PXY . For a batch size is b 2, at round t, the bet is placed on (Xb(t 1)+1, Yb(t 1)+1), . . . , (Xbt, Ybt) . In this case, the empirical mean embeddings are computed analogous to (14) but using {(Xi, Yi)}i b(t 1). We defer the details to Appendix D. Such payoff function satisfies the necessary conditions for the wealth process to be a nonnegative martingale, and hence, the resulting sequential test has time-uniform type I error control. The same argument as in the proof of Theorem 2.4 can be used to show that the resulting test is consistent. The main downside of minibatching is that monitoring of the test (and hence, optional stopping) is allowed only after processing every b points from PXY .

Distribution Drift. As discussed in Section 1, batch independence tests have an inflated false alarm rate even under mild changes in distribution. In contrast, SKIT remains valid even when the data distribution drifts over time. For a stream of independent points, we claimed that our test controls the type I error as long as only one of the marginal distributions changes at each round. In Appendix D, we provide an example that shows that this assumption is necessary for the validity of our tests. Our tests can also be used to test instantaneous independence between two streams. Formally, define Dt := {(Xi, Yi)}i 2t and consider:

H0 : t, X2t 1 Y2t 1 | Dt 1 and X2t Y2t | Dt 1, (18a)

H1 : t : X2t 1 Y2t 1 | Dt 1 or X2t Y2t | Dt 1. (18b)

Assumption 2.7. Suppose that under H0 in (18a), it also holds that:

(a) The cross-links between X and Y streams are not allowed, meaning that for all t 1,

Yt Xt 1 | Yt 1, {(Xi, Yi)}i t 2,

Xt Yt 1 | Xt 1, {(Xi, Yi)}i t 2. (19)

(b) For all t 1, either (Xt, Xt 1) or (Yt, Yt 1) are exchangeable conditional on {(Xi, Yi)}i t 2.

In the above, (a) relaxes the independence assumption within each pair, and (b) generalizes the assumption about allowed changes in the marginal distributions of X and Y . Under the above setting, we deduce that our test retains type1 error control, and the proof is deferred to Appendix B.2.2.

Theorem 2.8. Suppose that H0 in (18a) is true. Further, assume that Assumption 2.7 holds. Then HSIC-based SKIT (Algorithm 2) satisfies: PH0 (τ < ) α.

Chwialkowski & Gretton (2014) considered a related (at a high level) problem of testing instantaneous independence between a pair of time series. Similar to distribution drift,

HSIC fails to test independence between innovations in time series since naively permuting one series destroys the underlying structure. Chwialkowski & Gretton (2014) used a subset of permutations rotations by circular shifting (allowed by their assumption of strict stationarity) of one series for preserving the structure to design a p-value and used the assumption of mixing (decreasing memory of a process) to justify the asymptotic validity. The setting we consider is very different, and we make no assumptions of mixing or stationarity anywhere. Related works on independence testing for time series also include (Chwialkowski et al., 2014; Besserve et al., 2013). In the next section, we extend the methodology to other dependence measures.

3. Alternative Dependence Measures

Let C1 and C2 denote some classes of bounded functions c1 : X R and c2 : Y R respectively. For a class C of functions c : X Y R that factorize into the product: c(x, y) = c1(x)c2(y) for some c1 C1 and c2 C2, the general form of dependence measures (5) reduces to

m(PXY ; C1, C2) = sup c1 C1,c2 C2 Cov (c1(X), c2(Y )) .

Next, we develop SKITs based on two dependence measures of this form: COCO and KCC. While the corresponding witness functions do not admit a closed form, efficient algorithms for computing the plug-in estimates are available.

Witness Functions for COCO. Constrained covariance (COCO) is a criterion for measuring dependence based on covariance between smooth functions of random variables:

sup g,h: g G 1, h H 1

Cov (g(X), h(Y )) = sup g,h: g G 1, h H 1

g, CXY h G, (20)

where the supremum is taken over the unit balls in the respective RKHSs (Gretton et al., 2005c;b). At round t, we are interested in empirical witness functions computed from {(Xi, Yi)}i 2(t 1). The key observation is that maximizing the objective function in (20) with the plug-in estimator of the cross-covariance operator requires considering only functions in G and H that lie in the span of the data:

φ(Xi) 1 2(t 1)

j=1 φ(Xj) ,

ψ(Yi) 1 2(t 1)

j=1 ψ(Yj) .

Coefficients α and β that solve the maximization problem (20) define the leading eigenvector of the following

Sequential Kernelized Independence Testing

generalized eigenvalue problem (see Appendix D): 0 1 2(t 1) K L 1 2(t 1) L K 0

= γ K 0 0 L

where K = HKH, L = HLH, and H = I2(t 1) (1/(2(t 1))11 is centering projection matrix. Computing the leading eigenvector for (22) is computationally demanding for moderately large t. A common practice is to use lowrank approximations of K and L with fast-decaying spectrum (Jordan & Bach, 2001). We present an approach based on incomplete Cholesky decomposition in Appendix F.1.

Witness Functions for KCC. Kernelized canonical correlation (KCC) relies on the regularized correlation between smooth functions of random variables:

sup g G, h H

Cov (g(X), h(Y )) q

V [g(X)] + κ1 g 2 G q

V [h(Y )] + κ2 h 2 H

(23) where regularization is necessary for obtaining meaningful estimates of correlation (Jordan & Bach, 2001; Fukumizu et al., 2007a). Witness functions for KCC have the same form as for COCO (21), but α and β define the leading eigenvector of a modified problem (Appendix D).

SKIT based on COCO or KCC. Given a pair of the witness functions g and h for COCO (or KCC) criterion, the corresponding oracle payoff: f (Z2t 1, Z2t), is given by

1 2 (g (X2t) g (X2t 1)) (h (Y2t) h (Y2t 1)) , (24)

which has the form (7) with s = 1/2. To construct the test, we rely on estimates (ft)t 1 of the oracle payoff f

obtained by using ˆgt and ˆht, defined in (21), in (24). We assume that α and β in (22) are normalized: α Kα = 1 and β Lβ = 1. We conclude with a guarantee on timeuniform false alarm rate control of SKITs based on COCO (Algorithm 3), whose proof is deferred to Appendix B.3.

Theorem 3.1. Suppose that (A1) in Assumption 2.3 is satisfied. Then, under H0 in (1a) and (18a), COCO/KCC-based SKIT (Algorithm 3) satisfies: PH0 (τ < ) α.

Remark 3.2. The above result does not contain a claim regarding the consistency of the corresponding tests. If (A2) in Assumption 2.3 holds, the same argument as in the proof of Theorem 2.4 can be used to deduce that SKITs based on the oracle payoffs (with oracle witness functions g and h ) are consistent. In contrast to HSIC, for which the oracle witness function is closed-form and the respective plug-in estimator is amenable for the analysis, to argue about the consistency of SKITs based on COCO/KCC, it is necessary to place additional assumptions, especially since low-rank approximations of kernel matrices are involved. We note that a suf-

Algorithm 3 SKIT based on COCO (or KCC)

Input: significance level α (0, 1), data stream (Zi)i 1, where Zi = (Xi, Yi) PXY , λONS 1 = 0. for t = 1, 2, . . . do

Use Z1, . . . , Z2(t 1) to compute ˆgt and ˆht as in (21); Compute COCO payoff ft(Z2t 1, Z2t); Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

Compute λONS t+1 (Algorithm 1); end if end for

ficient condition for consistency is that the payoffs are positive on average: lim inft 1

t Pt i=1 fi(Z2i 1, Z2i) a.s. > 0.

Synthetic Experiments. To compare SKITs based on HSIC, COCO, and KCC payoffs, we use RBF kernel with hyperparameters taken to be inversely proportional to the second moment of the underlying variables; we observed no substantial difference when such selection is data-driven (median heuristic). We consider settings where the complexity of a task is controlled through a single univariate parameter:

(a) Gaussian model. For t 1, we consider Yt = Xtβ + εt, where Xt, εt N(0, 1). We have that β = 0 implies nonzero linear correlation (hence dependence). We consider 20 values for β, spaced uniformly in [0,0.3], and use λX = 1/4 and λY = 1/(4(1+β2)) as kernel hyperparameters.

(b) Spherical model. We generate a sequence of dependent but linearly uncorrelated random variables by taking (Xt, Yt) = ((Ut)(1), (Ut)(2)), where Ut iid Unif(Sd), for t 1. Sd denotes a unit sphere in Rd and u(i) is the i-th coordinate of u. We consider d {3, . . . , 15}, and use λX = λY = d/4 as kernel hyperparameters.

We stop monitoring after observing 20000 points from PXY (if SKIT does not stop by that time, we retain the null) and aggregate the results over 200 runs for each value of β and d. In Figure 3, we confirm that SKITs control the type I error and adapt to the complexity of a task. In settings with a very low signal-to-noise ratio (small β or large d), SKIT s power drops, but in such cases, both sequential and batch independence tests inevitably require a lot of data to reject the null. We defer additional experiments to Appendix E.4.

4. Symmetry-based Betting Strategies

In this section, we develop a betting strategy that relies on symmetry properties, whose advantage is that it overcomes the kernel boundedness assumption that underlined

Sequential Kernelized Independence Testing

0.00 0.05 0.10 0.15 0.20 0.25 0.30 β

( ) Rejection rate (- -) Sample budget used to reject the null

HSIC COCO KCC α = 0.05

(a) Gaussian model.

4 6 8 10 12 14 d

( ) Rejection rate (- -) Sample budget used to reject the null

HSIC COCO KCC α = 0.05

(b) Spherical model.

Figure 3. Rejection rate and scaled sample size used to reject the null hypothesis for synthetic data. Inspecting the rejection rate for β = 0 (independence holds) confirms that the type I error is controlled. Further, we confirm that SKITs are adaptive to the complexity (smaller β and larger d correspond to harder settings).

the SKIT construction. For example, using this betting strategy with a linear kernel: k(x, y) = l(x, y) = x, y readily implies a valid sequential linear correlation test. Consider

Wt = ˆgt(Z2t 1) + ˆgt(Z2t) ˆgt( Z2t 1) ˆgt( Z2t), (25)

where ˆgt = ˆµXY ˆµX ˆµY is the unnormalized plug-in witness function computed from {Zi}i 2(t 1). Symmetrybased betting strategies rely on the following key fact. Proposition 4.1. Under any distribution in H0, Wt is symmetric around zero, conditional on Ft 1.

By construction, we expect the sign and magnitude of Wt to be positively correlated under the alternative. We consider three payoff functions that aim to exploit this fact.

1. Composition with an odd function. This approach is based on the idea from sequential symmetry testing (Ramdas et al., 2020) that composition with an odd

function of a symmetric around zero random variable is mean-zero. Absent knowledge regarding the scale of considered random variables, it is natural to standardize {Wi}i 1 in a predictable way. We consider

f odd t (Wt) = tanh (Wt/Nt 1), (26)

where Nt = Q0.9({|Wi|}i t) Q0.1({|Wi|}i t), and Qα({|Wi|}i t) is the α-quantile of the empirical distribution of the absolute values of {Wi}i t. (The choices of 0.1 and 0.9 are heuristic, and can be replaced by other constants without violating the validity of the test.) The composition approach has demonstrated promising empirical performance for the betting-based two-sample tests of Shekhar & Ramdas (2021) and conditional independence tests of Shaer et al. (2023).

2. Rank-based approach. Inspired by sequential signedrank test of symmetry around zero of Reynolds Jr. (1975), we consider the following payoff function:

f rank t (Wt) = sign(Wt) rk(|Wt|)

where rk(|Wt|) = Pt i=1 1 {|Wi| |Wt|}.

3. Predictive approach. At round t, we fit a probabilistic predictor pt : R+ [0, 1], e.g., logistic regression, using {|Wi| , sign (Wi)}i t 1 as feature-label pairs. We consider the following payoff function:

f pred t (Wt) = (2pt(|Wt|) 1)+ (1 2ℓt(Wt)) , (28) where ( )+ = max { , 0} and ℓt(|Wt| , sign (Wt)) is the misclassification loss of the predictor pt.

In the next result, whose proof is deferred to Appendix B.4, we show that symmetry-based SKITs are valid.

Algorithm 4 SKIT with symmetry-based betting

Input: significance level α (0, 1), data stream (Zi)i 1, where Zi = (Xi, Yi) PXY , λONS 1 = 0. for t = 1, 2, . . . do

Observe Z2t 1, Z2t and compute Wt as in (25); Compute payoff f odd t (Wt) as in (26); Update the wealth process Kt as in (4); if Kt 1/α then

Reject H0 and stop; else

Compute λONS t+1 (Algorithm 1); end if end for

Theorem 4.2. Under H0 in (1a) and (18a), the symmetrybased SKIT (Algorithm 4) satisfies: PH0 (τ < ) α.

Sequential Kernelized Independence Testing

Synthetic Experiments. To compare the symmetry-based payoffs, we consider the Gaussian model along with a GRAPA betting fractions. For visualization purposes, we complete monitoring after observing 2000 points from the joint distribution. In Figure 4(a), we observe that the resulting SKITs demonstrate similar performance. In Figure 4(b), we demonstrate that SKIT with a linear kernel has high power under the Gaussian model, whereas its false alarm rate does not exceed α under the spherical model. Additional synthetic experiments can be found in Appendix E.3.

0.0 0.2 0.4 0.6 0.8 1.0 β

( ) Rejection rate (- -) Sample budget used to reject the null

f pred t f rank t f odd t α = 0.05

0.00 0.05 0.10 0.15 0.20 0.25 0.30 β

( ) Rejection rate (- -) Sample budget used to reject the null

2 3 4 5 6 7 8 9 10 d

Spherical model Gaussian model α = 0.05

Figure 4. (a) SKITs with symmetry-based payoffs have high power under the Gaussian model. (b) SKIT with linear kernel has high power under the Gaussian model (X and Y are linearly correlated for β = 0), and its false alarm rate is controlled under the spherical model (X and Y are linearly uncorrelated but dependent).

Real Data Experiments. We analyze average daily temperatures4 in four European cities: London, Amsterdam, Zurich, and Nice, from January 1, 2017, to May 31, 2022. The processes underlying temperature formation are com-

4data source: https://www.wunderground.com

plex and act both on macro (e.g., solar phase) and micro (e.g., local winds) levels. While average daily temperatures in selected cities share similar cyclic patterns, one may still expect the temperature fluctuations occurring in nearby locations to be dependent. We use SKIT for testing instantaneous independence (as per (18)) between fluctuations (assuming that the conditions that underlie our test hold).

We run SKIT with the rank-based payoff and ONS betting fractions for each pair of cities using 6/α as a rejection threshold (accounting for multiple testing). We select the kernel hyperparameters via the median heuristic using recordings for the first 20 days. In Figures 5, we illustrate that SKIT supports our conjecture that temperature fluctuations are dependent in nearby locations. We also run this experiment for four cities in South Africa (see Appendix E.5).

In addition, we analyze the performance of SKIT on MNIST data; the details are deferred to Appendix E.6.

10 5 0 5 10 Longitude

London (LDN) Amsterdam (AMS)

Zurich (ZRH)

Figure 5. Solid lines connect cities for which the null is rejected. SKIT supports the conjecture regarding dependent temperature fluctuations in nearby locations.

5. Conclusion

A key advantage of sequential tests is that they can be continuously monitored, allowing an analyst to adaptively decide whether to stop and reject the null hypothesis or to continue collecting data, without inflating the false positive rate. In this paper, we design consistent sequential kernel independence tests (SKITs) following the principle of testing by betting. SKITs are also valid beyond the i.i.d. setting, allowing the data distribution to drift over time. Experiments on synthetic and real data confirm the power of SKITs.

Acknowledgements. The authors thank Ian Waudby-Smith, Tudor Manole and the anonymous reviewers for constructive feedback. The authors also acknowledge Lucas Janson and Will Hartog for thoughtful questions and comments at the International Seminar for Selective Inference.

Sequential Kernelized Independence Testing

Balsubramani, A. and Ramdas, A. Sequential nonparametric testing with the law of the iterated logarithm. In Conference on Uncertainty in Artificial Intelligence, 2016.

Besserve, M., Logothetis, N. K., and Sch olkopf, B. Statistical analysis of coupled time series with kernel crossspectral density operators. In Advances in Neural Information Processing Systems, 2013.

Breiman, L. Optimal gambling systems for favorable games. Berkeley Symposium on Mathematical Statistics and Probability, 1962.

Chwialkowski, K. and Gretton, A. A kernel independence test for random processes. In International Conference on Machine Learning, 2014.

Chwialkowski, K. P., Sejdinovic, D., and Gretton, A. A wild bootstrap for degenerate kernel tests. In Advances in Neural Information Processing Systems, 2014.

Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Conference on Learning Theory, 2018.

Darling, D. A. and Robbins, H. E. Some nonparametric sequential tests with power one. Proceedings of the National Academy of Sciences, 1968.

Ernst, P. A., Shepp, L. A., and Wyner, A. J. Yule s nonsense correlation solved! The Annals of Statistics, 2017.

Fan, X., Grama, I., and Liu, Q. Exponential inequalities for martingales with applications. Electronic Journal of Probability, 2015.

Fukumizu, K., Bach, F. R., and Gretton, A. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 2007a.

Fukumizu, K., Gretton, A., Sun, X., and Sch olkopf, B. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, 2007b.

Gretton, A. A simpler condition for consistency of a kernel independence test. ar Xiv preprint: 1501.06103, 2015.

Gretton, A., Bousquet, O., Smola, A., and Sch olkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic Learning Theory, 2005a.

Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Sch olkopf, B. Kernel methods for measuring independence. Journal of Machine Learning Research, 2005b.

Gretton, A., Smola, A., Bousquet, O., Herbrich, R., Belitski, A., Augath, M., Murayama, Y., Pauls, J., Sch olkopf, B., and Logothetis, N. Kernel constrained covariance for dependence measurement. In Tenth International Workshop on Artificial Intelligence and Statistics, 2005c.

Gr unwald, P., Henzi, A., and Lardy, T. Anytime-valid tests of conditional independence under model-X. Journal of the American Statistical Association, 2023.

Gr unwald, P., de Heide, R., and Koolen, W. M. Safe testing. In Information Theory and Applications Workshop, 2020.

Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 2007.

Hoeffding, W. The strong law of large numbers for U statistics. Technical report, University of North Carolina, 1961.

Jordan, M. I. and Bach, F. R. Kernel independent component analysis. Journal of Machine Learning Research, 2001.

Kelly, J. L. A new interpretation of information rate. The Bell System Technical Journal, 1956.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.

Muandet, K., Fukumizu, K., Sriperumbudur, B., and Sch olkopf, B. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 2017.

Ramdas, A., Ruf, J., Larsson, M., and Koolen, W. Admissible anytime-valid sequential inference must rely on nonnegative martingales. ar Xiv preprint: 2009.03167, 2020.

Ramdas, A., Ruf, J., Larsson, M., and Koolen, W. Testing exchangeability: Fork-convexity, supermartingales and e-processes. International Journal of Approximate Reasoning, 2022.

Reynolds Jr., M. R. A sequential signed-rank test for symmetry. The Annals of Statistics, 1975.

Shaer, S., Maman, G., and Romano, Y. Model-free sequential testing for conditional independence via testing by betting. In International Conference on Artificial Intelligence and Statistics, 2023.

Shafer, G. Testing by betting: a strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2021.

Sequential Kernelized Independence Testing

Shafer, G. and Vovk, V. Game-Theoretic Foundations for Probability and Finance. Wiley Series in Probability and Statistics. Wiley, 2019.

Shekhar, S. and Ramdas, A. Nonparametric two-sample testing by betting. ar Xiv preprint: 2112.09162, 2021.

Ville, J. Etude critique de la notion de collectif. Th eses de l entre-deux-guerres. Gauthier-Villars, 1939.

Waudby-Smith, I. and Ramdas, A. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2023.

Yule, G. U. Why do we sometimes get nonsense-correlations between time-series? a study in sampling and the nature of time-series. Journal of the Royal Statistical Society, 1926.

Sequential Kernelized Independence Testing

A. Independence Testing for Streaming Data

In Section A.1, we describe a permutation-based approach for conducting batch HSIC and show that continuous monitoring of batch HSIC (without corrections for multiple testing) leads to an inflated false alarm rate. In Section A.2, we introduce the sequential two-sample testing (2ST) problem and describe a reduction of sequential independence testing to sequential 2ST. In Section A.3, we compare our test to HSIC in the batch setting.

A.1. Failure of Batch HSIC under Continuous Monitoring

To conduct independence testing using batch HSIC, we use permutation p-value (with M = 1000 random permutations): P = 1 M+1(1 + PM m=1 1 {Tm T}), where Tm is the value of HS-norm computed from the m-th permutation and T is HS-norm value on the original data. In other words, suppose that we are given a sample Z1, . . . , Zt, where Zi = (Xi, Yi). Let St denote the set of all permutations of t indices and let σ Unif(St) be a random permutation of indices. Then:

(X1, Y1), . . . , (Xt, Yt) = T = \ HSICb ((X1, Y1), . . . , (Xt, Yt))

(X1, Yσm(1)), . . . , (Xt, Yσm(t)), = Tm = \ HSICb (X1, Yσm(1)), . . . , (Xt, Yσm(t)) , m {1, . . . , M} ,

where we use a biased estimator of HSIC:

\ HSICb = 1

i,j Kij Lij + 1

i,j,q,r Kij Lqr 2

i,j,q Kij Liq = 1

t2 tr (KHLH) .

For brevity, we use Kij = k(Xi, Xj), Lij = l(Yi, Yj) for i, j {1, . . . , t}. Next, we study batch HSIC under (a) fixed-time and (b) continuous monitoring. We consider a simple case when X and Y are independent standard Gaussian random variables. We consider (re)conducting a test at 12 different sample sizes: t {50, 100, . . . , 600}:

(a) Under fixed-time monitoring, for each value of t, we sample a sequence Z1, . . . , Zt (100 times for each t) and conduct batch-HSIC test. The goal is to confirm that batch-HSIC controls type I error by tracking the standard miscoverage rate.

(b) Under continuous monitoring, we sample new datapoints and re-conduct the test. We illustrate inflated type I error by tracking the cumulative miscoverage rate, that is, the fraction of times, the test falsely rejects the independence null.

The results are presented in Figure 6. For Bonferroni correction, we decompose the error budget as: α = P i=1 α i(i+1), that is, for t-th test we use threshold αt = α/(t(t + 1)) for testing.

A.2. Sequential Independence Testing via Sequential Two-Sample Testing

First, we introduce the sequential two-sample testing problem. Suppose that we observe a stream of data: ( X1, Y1), ( X2, Y2), . . . , where ( Xt, Yt) iid P Q. Two-sample testing refers to testing:

H0 : ( Xt, Yt) iid P Q and P = Q, vs. H1 : ( Xt, Yt) iid P Q and P = Q.

In Figure 1, we compared our test against the approach based on the reduction of independence testing to two-sample testing. We used the sequential two-sample kernel MMD test of Shekhar & Ramdas (2021) with the product kernel K (that is, a product of Gaussian kernels) and the same set of hyperparameters as for our test for a fair comparison. To reduce sequential independence testing to any off-the-shelf sequential two-sample testing procedure, we convert the original sequence of points from PXY to a sequence of i.i.d. ( Xt, Yt)-pairs, where Xt PXY and Yt PX PY respectively; see Figure 7(a). At t-th round, we randomly choose one point as Xt, e.g., (X1, Y1) for the first triple. Next, we obtain Yt by randomly matching X

Sequential Kernelized Independence Testing

100 200 300 400 500 600 700 Sample size

Rejection rate

CM: Batch HSIC CM: Batch HSIC + Bonferroni FTM: Batch HSIC α = 0.05

Figure 6. Inflated false alarm rate of batch HSIC under continuous monitoring (CM, red line with squares) for the case when X and Y are independent standard Gaussian random variables. Bonferroni correction (CM, blue line with triangles) restores type I error control. As expected, type I error is controlled at a specified level under fixed-time monitoring (FTM, green line with circles).

and Y from two other pairs, e.g., (X2, Y3) or (X3, Y2) for the first triple. In fact, the betting-based sequential two-sample test of (Shekhar & Ramdas, 2021) allows removing the effect of randomization (i.e., throwing away one observation in each triple), by averaging payoffs evaluated on ( Xt, Y (1) t ) and ( Xt, Y (2) t ). Other approaches which do not require throwing data away are also available (Figures 7(b)) but those only yield an i.i.d. sequence only under the null.

X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5 X6 Y6

X1 X2 Y (2) 1 Y (2) 2

Y (1) 1 Y (1) 2

X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5 X6 Y6

X1 X2 X3 X4

Y1 Y2 Y4 Y2

X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5

X1 X2 X3 X4

Y1 Y2 Y3 Y4

Figure 7. Reducing sequential independence testing to sequential two-sample testing. Processing as per (a) results in a sequence of i.i.d. observations both under the null and under the alternative (making the results about power valid). Processing data as per (b) gives an i.i.d. sequence only under the null. Reduction (b) is very similar to reduction (c). However, the latter makes Xi, i 2, dependent on the past, and thus can not be used directly for considered sequential two-sample tests.

Additional Details of the Simulation Presented in Figure 1. We consider the Gaussian model: Yt = Xtβ + εt, where Xt, εt N(0, 1), t 1. We consider 10 values of β: β {0, 0.04, . . . , 0.36}, and for each β we repeat the simulation 100

Sequential Kernelized Independence Testing

times. In this simulation, we compare three approaches for testing independence (valid under continuous monitoring):

1. HSIC-based SKIT proposed in this work (Algorithm 2);

2. Batch HSIC adapted to continuous monitoring via Bonferroni correction. We allow monitoring after processing every n, n {10, 100}, new points from PXY , that is, the permutation p-value (computed over 2500 randomly sampled permutations) is compared against rejection thresholds: αn = α/(n(n + 1)), n = 1, 2, . . .

3. Sequential independence testing via reduction sequential 2ST as described above.

We use RBF kernel with the same set of kernel hyperparameters for all testing procedures: λX = 1/4, λY = 1/(4(1 + β2)).

A.3. Comparison in the Batch Setting

Sequential tests are complementary to batch tests and are not intended to replace them, and hence comparing the two on equal footing is hard. To highlight this, consider two simple scenarios. If we have 2000 data points, and HSIC fails to reject, there is not much we can do to rescue the situation. But if SKIT fails to reject, an analyst can collect more data and continue testing, retaining type I error control. In contrast, with 2 million points, HSIC will take forever to run, especially due to permutations. But if the alternative is true and the signal is strong, then SKIT may reject within 200 samples and stop. In short, the ability of SKIT to continue collecting and analyzing data is helpful for hard problems, and the ability of SKIT to stop early is helpful for easy problems. There is no easy sense in which one can compare them apples to apples and there is no sense in which batch HSIC uniformly dominates SKIT or vice versa. In a real setting, if an analyst has a strong hunch that the null is false and has the ability to collect data and run HSIC, the question is how much data should be collected? The answer depends on the underlying data distribution, which is of course unknown. With SKIT, data can be collected and analyzed sequentially. Theorem 2.4 implies that SKIT will stop early on easy problems and later on harder problems, all without knowing anything about the problem in advance. If however, one has a fixed batch of data, no chance to collect more, and no computational constraints, then running HSIC makes more sense.

To illustrate that batch HSIC can be superior to SKIT, we compare tests on a dataset with a prespecified sample size (500 observations from the Gaussian model) and track the empirical rejection rates of two tests. In Figure 8, we show that HSIC actually has higher power than SKIT. However, for β = 0.1 (where all tests have low power), Figure 3(a) shows that collecting just a bit more data (which is allowed) is needed for SKIT to reach perfect power. We also added a third method (D-SKIT) which removes the effect of the ordering of random variables under the assumptions that {(Xi, Yi)}n i=1 are independent draws from PXY . Let {σb}B b=1 define B random permutations of n indices, and let Kb n denote the wealth after betting on a sequence ordered according to σb. For each b, Kb n has expectation at most one, and hence (by linearity of

expectation and Markov s inequality) 1 n 1 B PB i=1 Kb n 1/α o is a valid level-α batch test. This test is a bit more stable: it improves SKIT s power on moderate-complexity setups at the cost of a slight power loss on more extreme ones.

0.0 0.2 0.4 0.6 0.8 1.0 β

Rejection rate

D-SKIT SKIT Batch HSIC α = 0.05

Figure 8. Comparison of SKIT and HSIC under Gaussian model in the batch setting. Non-surprisingly, batch HSIC performs best. D-SKIT improves over SKIT s power on moderate-complexity setups at the cost of a slight power loss on more extreme ones.

Sequential Kernelized Independence Testing

Section B.1 contains auxiliary results needed to prove the results presented in this paper. In Section B.2, we prove the results from Section 2. In Secton B.3, we prove the results from Section 3.

B.1. Auxiliary Results

Theorem B.1 (Ville s inequality (Ville, 1939)). Suppose that (Mt)t 0 is a nonnegative supermartingale process adapted to a filtration {Ft : t 0}. Then, for any a > 0 it holds that:

P ( t 1 : Mt a) E [M0]

Theorem B.2 (Theorem 3 in (Gretton et al., 2005a)). Assume that k and l are bounded almost everywhere by 1, and are nonnegative. Then for n > 1 and any δ (0, 1), it holds with probability at least 1 δ that:

HSIC(PXY ; G, H) \ HSICb(PXY ; G, H)

where α2 > 0.24 and C are some absolute constants.

B.2. Proofs for Section 2

In Section B.2.1, we prove several intermediate results. The proofs of the main results are deferred to Section B.2.2.

B.2.1. SUPPORTING LEMMAS

Before we state the first result, recall the definition of the empirical mean embeddings computed from the first 2(t 1) datapoints:

ˆµ(t) XY = 1 2(t 1)

i=1 φ(Xi) ψ(Yi),

ˆµ(t) X = 1 2(t 1)

i=1 φ(Xi), ˆµ(t) Y = 1 2(t 1)

where we highlight the dependence on the number of processed datapoints. We have the following result.

Lemma B.3. For the empirical (29) and population (9) mean embeddings, it holds that: ˆµ(t) XY ˆµ(t) X ˆµ(t) Y G H

a.s. µXY µX µY G H . (30)

Proof. We have µXY µX µY 2 G H = HSIC(PXY ; G, H), ˆµ(t) XY ˆµ(t) X ˆµ(t) Y 2

G H = \ HSIC (t) b (PXY ; G, H),

where the latter is a biased estimator of HSIC, computed from 2(t 1) datapoints from PXY . From Theorem B.2 and the Borel-Cantelli lemma, it follows that: ˆµ(t) XY ˆµ(t) X ˆµ(t) Y 2

a.s. µXY µX µY 2 G H .

The result then follows from the continuous mapping theorem.

Lemma B.4. Suppose that H1 in (1b) is true. Then for the oracle (11) and plug-in (13) witness functions, it holds that:

ˆgt, g G H a.s. 1. (31)

As a consequence, ˆgt g G H a.s. 0.

Sequential Kernelized Independence Testing

Proof. Suppose that the alternative in (1b) happens to be true. Then since k and l are characteristic kernels, it follows that:

µXY µX µY G H > 0.

We aim to show that: * ˆµ(t) XY ˆµ(t) X ˆµ(t) Y ˆµ(t) XY ˆµ(t) X ˆµ(t) Y G H

, µXY µX µY µXY µX µY G H

From Lemma B.3, we know that: ˆµ(t) XY ˆµ(t) X ˆµ(t) Y G H

a.s. µXY µX µY G H. Hence it suffices to show that D ˆµ(t) XY ˆµ(t) X ˆµ(t) Y , µXY µX µY E

a.s. µXY µX µY 2 G H . (32)

Recall that: µXY µX µY = E h φ( X) ψ( Y ) i E h φ( X) i E h ψ( Y ) i . We have:

ˆµ(t) XY ˆµ(t) X ˆµ(t) Y = 1 1 2(t 1)

i=1 φ(Xi) ψ(Yi) 1 4(t 1)2 2(t 1)

j,k=1: j =k

φ(Xj) ψ(Yk)

Further, it holds that: D ˆµ(t) XY ˆµ(t) X ˆµ(t) Y , µXY µX µY E

= 1 1 2(t 1)

i=1 E X, Y h φ( X), φ(Xi) G ψ( Y ), ψ(Yi) H i

1 4(t 1)2 2(t 1)

j,k=1: j =k

E X h φ( X), φ(Xj) G i E Y h ψ( Y ), ψ(Yk) H i

For any (x, y) X Y, we have: E X, Y h φ( X), φ(x) G ψ( Y ), ψ(y) H i E X, Y h φ( X), φ(x) G ψ( Y ), ψ(y) H i

k( X, X)k(x, x)l( Y , Y )k(y, y)

and similarly, for any (x, y) X Y it holds that: E X h φ( X), φ(x) G i E Y h ψ( Y ), ψ(y) H i 1.

Hence, by the SLLN, it follows that ((X, Y ), ( X, Y ) iid PXY ):

i=1 E X, Y h φ( X), φ(Xi) G ψ( Y ), ψ(Yi) H i a.s. EX,Y, X, Y h φ( X), φ(X) G ψ( Y ), ψ(Y ) H i

= µXY , µXY G H .

Similarly, by the SLLN for U-statistics with bounded kernel (Hoeffding, 1961), it follows that:

1 4(t 1)2 2(t 1)

j,k=1: j =k

E X h φ( X), φ(Xj) G i E Y h ψ( Y ), ψ(Yk) H i

a.s. EX, X h φ( X), φ(X) G i EY, Y h ψ( Y ), ψ(Y ) H i

= µX µY , µX µY G H .

Sequential Kernelized Independence Testing

Hence, we deduce that: D ˆµ(t) XY ˆµ(t) X ˆµ(t) Y , µXY µX µY E

a.s. µXY , µXY G H µX µY , µX µY G H

= µXY µX µY , µXY µX µY G H = µXY µX µY 2 G H .

Recalling (32), the proof of (31) is complete. To establish the consequence, simply note that:

ˆgt g G H = q

2 1 ˆgt, g G H ,

and hence the result follows.

Lemma B.5. Suppose that (xt)t 1 is a sequence of numbers such that limt xt = x. Then the corresponding sequence of partial averages also converges to x, that is, limn 1

n Pn t=1 xt = x. This also implies that if (Xt)t 1 is a sequence of random variables such that Xt a.s. X, then (Pn t=1 Xt)/n a.s. X.

Proof. Fix any ε > 0. Since (xt)t 1 is converging, then M > 0:

|xt x| M, t 1.

Now, let n0 be such that |xt x| ε/2 for all n > n0. Further, choose any n1 > n0: Mn0/n1 ε/2. Hence, for any n > n1, it holds that: 1 n

t=n0+1 xt x

t=1 |xt x| + 1

t=n0+1 |xt x|

which implies the result.

Before we state the next result, recall that HSIC-based payoffs are based on the predictable estimates {ˆgi}i 1 of the oracle witness function g and have the following form:

fi(Z2i 1, Z2i) = 1

2 (ˆgi(Z2i 1) + ˆgi(Z2i)) 1

ˆgi( Z2i 1) + ˆgi( Z2i) , i 1.

f (Z2i 1, Z2i) = 1

2 (g (Z2i 1) + g (Z2i)) 1

g ( Z2i 1) + g ( Z2i) . (33)

Lemma B.6. Suppose that H1 in (1b) is true. Then it holds that:

i=1 fi(Z2i 1, Z2i) a.s. E [f (Z1, Z2)] , (34)

i=1 (fi(Z2i 1, Z2i))2 a.s. E (f (Z1, Z2))2 . (35)

Proof. We start by proving (34). Note that:

fi(Z2i 1, Z2i) = 1

2 (ˆgi(Z2i 1) + ˆgt(Z2i)) 1

ˆgi( Z2i 1) + ˆgi( Z2i)

2 ˆgi, (φ(X2i) φ(X2i 1)) (ψ(Y2i) ψ(Y2i 1)) G H .

Sequential Kernelized Independence Testing

Next, observe that: 1

i=1 fi(Z2i 1, Z2i) E [f (Z1, Z2)]

i=1 fi(Z2i 1, Z2i) 1

i=1 f (Z2i 1, Z2i)

i=1 f (Z2i 1, Z2i) E [f (Z1, Z2)]

| {z } a.s. 0

where the second term converges almost surely to 0 by the SLLN. For the first term, we have that: 1

i=1 fi(Z2i 1, Z2i) 1

i=1 f (Z2i 1, Z2i)

i=1 |fi(Z2i 1, Z2i) f (Z2i 1, Z2i)| .

Finally, note that:

|fi(Z2i 1, Z2i) f (Z2i 1, Z2i)| = 1

ˆgi g , (φ(X2i) φ(X2i 1)) (ψ(Y2i) ψ(Y2i 1)) G H

ˆgi g G H a.s. 0, (36)

where the convergence result is due to Lemma B.4. The result (34) then follows after invoking Lemma B.5. Next, we prove (35). Note that:

i=1 (fi(Z2i 1, Z2i))2 = 1

i=1 (fi(Z2i 1, Z2i) f (Z2i 1, Z2i) + f (Z2i 1, Z2i))2

i=1 (fi(Z2i 1, Z2i) f (Z2i 1, Z2i))2

| {z } a.s. 0

i=1 (f (Z2i 1, Z2i))(fi(Z2i 1, Z2i) f (Z2i 1, Z2i))

i=1 (f (Z2i 1, Z2i))2

| {z } a.s. E[(f (Z1,Z2))2]

where the first convergence result is due to (36) and Lemma B.5 and the second convergence result is due to the SLLN. Using (36) and Lemma B.5, we deduce that: 2

i=1 (f (Z2i 1, Z2i))(fi(Z2i 1, Z2i) f (Z2i 1, Z2i))

i=1 |fi(Z2i 1, Z2i) f (Z2i 1, Z2i)| a.s. 0,

and hence we conclude that the convergence (35) holds.

B.2.2. MAIN RESULTS

Theorem 2.1. Let C denote a class of functions c : X Y R for measuring dependence as per (5).

1. Under H0 in (1a) and (2a), any payoff f of the form (7) satisfies EH0 [f(Z1, Z2)] = 0.

2. Suppose that C satisfies (6). Under H1 in (1b), the oracle payoff f based on the witness function c satisfies EH1 [f (Z1, Z2)] > 0. Further, for λ defined in (8), it holds that EH1 [log(1 + λ f (Z1, Z2)] > 0. Hence, K t a.s. + , which implies that the oracle test is consistent: PH1(τ < ) = 1, where τ = inf {t 1 : K t 1/α}.

Sequential Kernelized Independence Testing

Proof. 1. Under H0 in (1a), we have that:

(X2t 1, Y2t 1) d= (X2t, Y2t) d= (X2t 1, Y2t) d= (X2t, Y2t 1),

and hence, the first part of the Proposition trivially follows from the linearity of expectation. Under distribution drift, we use that at least one of the marginal distributions does not change at each round. For example, suppose that at round t, it holds that: P 2t 1 X = P 2t X . For the stream of independent observations, we have: X2t Y2t 1 and X2t 1 Y2t. Further, under the H0 in (2a), it holds that: X2t 1 Y2t 1 and X2t Y2t. Hence, we have:

(X2t 1, Y2t 1) d= (X2t, Y2t 1) and (X2t 1, Y2t) d= (X2t, Y2t),

and hence, we get the result using linearity of expectation.

2. Under the i.i.d. setting, we have

E [f (Z2t 1, Z2t) | Ft 1] = E [f (Z1, Z2)] = s m(PXY ; C),

and hence the result follows from the fact that the functional class C satisfies the characteristic condition (6).

3. Let W := f (Z1, Z2), and consider EH1 [log(1 + λW)]. We know that EH1 [W] > 0. We use the following inequality (Fan et al., 2015, Equation (4.12)): for any y 1 and λ [0, 1), it holds:

log(1 + λy) λy + y2 (log(1 λ) + λ)

Hence E [log(1 + λW)] λE [W] + E W 2 (log(1 λ) + λ) .

Finally, using that log(1 x) + x x2/(2(1 x)) for x [0, 1), we get:

EH1 [log(1 + λ W)] (EH1 [W])2/2 EH1 [W] + EH1 [W 2] > 0,

where recall that:

λ = E [W] E [W] + E [W 2] (0, 1).

The wealth process corresponding to the oracle test satisfies:

i=1 (1 + λ f (Z2i 1, Z2i)) = exp

i=1 log(1 + λ f (Z2i 1, Z2i))

By the Strong Law of Large Numbers (SLLN), we have:

i=1 log(1 + λ f (Z2i 1, Z2i)) a.s. E [log(1 + λ W)] > 0.

Hence, we get that Kt a.s. + , and hence, the oracle test is consistent.

Theorem 2.4. Suppose that Assumption 2.3 is satisfied. The following claims hold for HSIC-based SKIT (Algorithm 2):

1. Suppose that H0 in (1a) or (2a) is true. Then SKIT ever stops with probability at most α: PH0 (τ < ) α.

2. Suppose that H1 in (1b) is true. Then it holds that Kta.s., + and thus SKIT is consistent: PH1(τ < ) = 1. Further, the wealth grows exponentially, and the corresponding growth rate satisfies

lim inf t log Kt

a.s. E[f (Z1,Z2)]

4 E[f (Z1,Z2)] E[(f (Z1,Z2))2] 1 , (15)

where f is the oracle payoff defined in (12).

Sequential Kernelized Independence Testing

Remark B.7. While it will be clear from the proof that the i.i.d. assumption is sufficient but not necessary for asymptotic power one, the more relaxed sufficient conditions are slightly technical to state and thus omitted.

Proof. 1. First, let us show that the predictable estimates of the oracle payoff function are bounded when the scaling factor s = 1/2 is used. Recall that:

ft((x , y ), (x, y)) = 1

2 (ˆgt(x , y ) ˆgt(x , y) + ˆgt(x, y) ˆgt(x, y ))

2 ˆgt, φ(x ) ψ(y ) φ(x ) ψ(y) + φ(x) ψ(y) φ(x) ψ(y ) G H

2 ˆgt, (φ(x ) φ(x)) (ψ(y ) ψ(y)) G H .

Note that: |ft((x , y ), (x, y))| 1

2 ˆgt G H (φ(x ) φ(x)) (ψ(y ) ψ(y)) G H

2 (φ(x ) φ(x)) (ψ(y ) ψ(y)) G H

2 φ(x ) φ(x) G ψ(y ) ψ(y) H

2(1 k(x , x)) p

2(1 l(y , y))

and hence, ft((x , y ), (x, y)) [ 1, 1]. Next, we show that constructed payoff function yields a fair bet. Indeed, we have that: E [ft(Z2t 1, Z2t) | Ft 1] = ˆgt, µXY µX µY G H ,

and in particular, the above implies that EH0 [ft(Z2t 1, Z2t) | Ft 1] = 0 for H0 in (1a). For H0 in (2a), it is easy to see that the result holds using the form (37). We use that X2t 1 Y2t 1, X2t Y2t, X2t Y2t 1, X2t 1 Y2t, and the fact that at least one of the marginal distributions does not change. Next, we show that for all strategies for selecting betting fractions that are considered in this work, the resulting wealth process is a nonnegative martingale. In case a GRAPA/ONS strategies are used, the resulting wealth process is clearly a nonnegative martingale since betting fractions are predictable. The mixed wealth process Kmixed t

t 1 is a nonnegative martingale under the null H0, and hence

EH0 Kmixed t | Ft 1 = E Z 1

0 Kλ t 1(1 + λft(Z2t 1, Z2t))ν(λ)dλ | Ft 1

0 Kλ t 1EH0 [1 + λft(Z2t 1, Z2t) | Ft 1] ν(λ)dλ

0 Kλ t 1ν(λ)dλ

= Kmixed t 1 ,

where the interchange of conditional expectation and integration is justified by the conditional monotone convergence theorem. The assertion of the Theorem then follows directly from Ville s inequality (Proposition B.1) when a = 1/α.

2. Next, we establish the consistency of HSIC-based SKIT with ONS betting strategy. Under the ONS betting strategy, for any sequence of outcomes (fi)i 1, fi [ 1, 1], i 1, it holds that (see the proof of Theorem 1 in (Cutkosky & Orabona, 2018)):

log Kt(λ0) log Kt = O

where Kt(λ0) is the wealth of any constant betting strategy λ0 [ 1/2, 1/2] and Kt is the wealth corresponding to the ONS betting strategy. It follows that the wealth process corresponding to the ONS betting strategy satisfies

t log Kt(λ0)

Sequential Kernelized Independence Testing

for some absolute constant C > 0. Next, let us consider:

Pt i=1 fi Pt i=1 f 2 i 1

We obtain: log Kt(λ0)

i=1 log(1 + λ0fi)

i=1 (λ0fi λ2 0f 2 i )

1 t Pt i=1 fi 4 0

1 t Pt i=1 fi 1 t Pt i=1 f 2 i 1

where in (a) we used5 that log(1 + x) x x2 for x [ 1/2, 1/2]. From Lemma B.6, it follows for fi = fi(Z2i 1, Z2i) that:

1 t Pt i=1 fi(Z2i 1, Z2i)

1 t Pt i=1 fi(Z2i 1, Z2i)

1 t Pt i=1(fi(Z2i 1, Z2i))2 1

a.s. E [f (Z1, Z2)]

4 E [f (Z1, Z2)]

E [(f (Z1, Z2))2] 1 . (41)

Further, note that: E [f (Z1, Z2)] = µXY µX µY G H = p

HSIC(PXY ; G, H),

which is positive if the H1 is true. Hence, using (39), we deduce that the growth rate of the ONS wealth process satisfies

lim inf t log Kt

t E [f (Z1, Z2)]

4 E [f (Z1, Z2)]

E [(f (Z1, Z2))2] 1 . (42)

We conclude that the test is consistent, that is, if H1 is true, then P(τ < ) = 1.

Proposition 2.5. The optimal log-wealth S := E [log(1 + λ f (Z1, Z2))] that can be achieved by an oracle betting scheme (16) which knows f from (12) and the underlying distribution satisfies:

S E [f (Z1, Z2)]

8E [f (Z1, Z2)]

3E [(f (Z1, Z2))2] 1 . (17)

Proof. We start by establishing the upper bound in (17). The fact that S E [f (Z1, Z2)] /2 trivially follows from E [log(1 + λf (Z1, Z2))] λE [f (Z1, Z2)] E [f (Z1, Z2)] /2. Since for any x [ 0.5, 0.5], it holds that: log(1 + x) x 3x2/8, we know that:

S max λ [ 0.5,0.5]

λE [f (Z1, Z2)] 3

8λ2E (f (Z1, Z2))2 , (43)

and by solving the maximization problem, we get the upper bound:

3 (E [f (Z1, Z2)])2

E [(f (Z1, Z2))2], (44)

assuming (E [f (Z1, Z2)])2/E (f (Z1, Z2))2 3/8. On the other hand, it always holds that: S E [f (Z1, Z2)] /2. To obtain the claimed bound, we multiply the RHS of (44) by two, which completes the proof of (17).

5A slightly better constant for the growth rate (0.3 in place of 1/4) can be obtained by using the inequality: log(1 + x) x 5

6x2, that holds x [ 0.5, 0.5].

Sequential Kernelized Independence Testing

Theorem 2.8. Suppose that H0 in (18a) is true. Further, assume that Assumption 2.7 holds. Then HSIC-based SKIT (Algorithm 2) satisfies: PH0 (τ < ) α.

Proof. Recall that at round t, the payoff function has form:

ft((X2t 1, Y2t 1), (X2t, Y2t)) = 1

2 ˆgt, (φ(X2t) φ(X2t 1)) (ψ(Y2t) ψ(Y2t 1)) G H.

Let Dt = {(Xi, Yi)}i 2(t 1). To establish validity, we need to show that under H0 in (18a),

E [ft((X2t 1, Y2t 1), (X2t, Y2t)) | Dt] = 0, (45)

and hence it suffices to show that:

E [(φ(X2t) φ(X2t 1)) (ψ(Y2t) ψ(Y2t 1)) | Dt] = 0.

Due to independence under the null H0, we have:

E [φ(X2t 1) ψ(Y2t 1) | Dt] = E [φ(X2t 1) | Dt] E [ψ(Y2t 1) | Dt] =: µ2t 1 X µ2t 1 Y ,

E [φ(X2t) ψ(Y2t) | Dt] = E [φ(X2t) | Dt] E [ψ(Y2t) | Dt] =: µ2t X µ2t Y ,

Consider one of the cross-terms φ(X2t) ψ(Y2t 1). We have the following:

E [φ(X2t) ψ(Y2t 1) | Dt] a= E [E [φ(X2t) ψ(Y2t 1) | X2t 1, Dt] | Dt]

b= E [E [φ(X2t) | X2t 1, Dt] E [ψ(Y2t 1) | X2t 1, Dt] | Dt]

c= E [E [φ(X2t) | X2t 1, Dt] E [ψ(Y2t 1) | Dt] | Dt]

d= E [E [φ(X2t) | X2t 1, Dt] | Dt] E [ψ(Y2t 1) | Dt]

e= E [φ(X2t) | Dt] E [ψ(Y2t 1) | Dt]

f= µ2t X µ2t 1 Y .

In the above, (a) uses the law of iterated expectations and conditioning on X2t 1, (b) uses the assumption (19) about conditional independence, (c) uses the independence null assumption (1a), (d) uses that E [ψ(Y2t 1) | Dt] is σ(Dt)- measurable, (e) uses the law of iterated expectations, and (f) uses the definitions of the mean embeddings of conditional distributions. An analogous argument can be used to deduce:

E [φ(X2t 1) ψ(Y2t) | Dt] = µ2t 1 X µ2t Y .

We get that:

E [(φ(X2t) φ(X2t 1)) (ψ(Y2t) ψ(Y2t 1)) | Dt] = µ2t 1 X µ2t 1 Y + µ2t X µ2t Y µ2t 1 X µ2t Y µ2t X µ2t 1 Y = (µ2t X µ2t 1 X ) (µ2t Y µ2t 1 Y ),

and hence, if either (X2t 1, X2t) or (Y2t 1, X2t) are exchangeable conditional on Dt, it follows that either µ2t X = µ2t 1 X or µ2t Y = µ2t 1 Y respectively. This, in turn, implies that (45) holds, and hence, the result follows.

B.3. Proofs for Section 3

Theorem 3.1. Suppose that (A1) in Assumption 2.3 is satisfied. Then, under H0 in (1a) and (18a), COCO/KCC-based SKIT (Algorithm 3) satisfies: PH0 (τ < ) α.

Proof. It suffices to show that the proposed payoff functions are bounded. The rest of the proof follows will follow the same steps as the proof of Theorem 2.4 (for a stream of independent observations) or Theorem 2.8 (for time-varying independence

Sequential Kernelized Independence Testing

null), and we omit the details. Note that: ˆht(y ) ˆht(y) = ˆht, ψ(y ) H ˆht, ψ(y) H

= ˆht, ψ(y ) ψ(y) H

ˆht H ψ(y ) ψ(y) H

ψ(y ) ψ(y) H = p

2(1 l(y, y ))

where we used that ˆht H 1 due to normalization. Analogous bound holds for |ˆgt(x ) ˆgt(x)|. We conclude that any predictable estimate of the oracle payoff function for COCO (or KCC) satisfies

|ft((x , y ), (x, y))| 1,

as proposed. The fact that the payoff function is fair trivially follows from the definition. Regarding the existence of the oracle payoff, whose mean is positive under H1 in (1b), note that if k and l are characteristic kernels, then COCO and KCC satisfy the characteristic condition (6); see Jordan & Bach (2001); Gretton et al. (2005c;b). Hence, the result follows from Theorem 2.1. This completes the proof.

B.4. Proofs for Section 4

Theorem 4.2. Under H0 in (1a) and (18a), the symmetry-based SKIT (Algorithm 4) satisfies: PH0 (τ < ) α.

Proof. For any t 1, we have that the payoffs defined in (26), (27), and (28) are bounded: ft(w) [ 1, 1], w R. Due to Proposition 4.1, we know that, under the null, Wt is a random variable that is symmetric around zero (conditional on Ft 1). Hence, for the composition approach, it trivially follows that EH0 f odd t (Wt) | Ft 1 = 0 since a composition with an odd function is used. For the rank and predictive approaches, we use the fact that, under the null, sign(Wt) |Wt| | Ft 1. Since, EH0 [sign(Wt) | Ft 1] = 0, it then follows that EH0 f rank t (Wt) | Ft 1 = 0. Using that sign(Wt) |Wt| | Ft 1 and by conditioning on the sign of Wt, we get:

EH0 [ℓt(Wt) | Ft 1] = 1

2PH0 (pt(|Wt|) 1/2) + 1

2PH0 (pt(|Wt|) < 1/2) = 1

Hence EH0 [1 2ℓt(Wt) | Ft 1] = 0. The rest of the proof regarding the validity of the symmetry-based SKITs follows the same steps as the proof of Theorem 2.4, and we omit the details.

Sequential Kernelized Independence Testing

C. Selecting Betting Fractions

As alluded to in Remark 2.2, sticking to a single fixed betting fraction, λt = λ [0, 1], t 1, may result in a wealth process that either has a sub-optimal growth rate under the alternative or tends to zero almost surely (see Figure 9). Mixing over different betting fractions is a simple approach that often works well in practice. Given a fine grid of values: Λ = λ(1), . . . , λ(J) , e.g., uniformly spaced values on the unit interval, consider

Kmixed t = 1

λ(j) Λ Kt(λ(j)), (46)

where Kt(λ(j))

t 0 is a wealth process corresponding to a constant-betting strategy with betting fraction λ(j) 6.

0 200 400 600 800 1000 1200 1400 Number of pairs

Mixing λ = 0.95 λ = 0.5 Threshold: 1/α = 20

0 200 400 600 800 1000 1200 1400 Number of pairs

Mixing λ = 0.95 λ = 0.5 Threshold: 1/α = 20

Figure 9. SKIT with HSIC payoff function on two particular realizations of streams of dependent data: Yt = 0.1 Xt + εt, Xt, εt N(0, 1). For both cases, we consider a mixed wealth process for Λ = {0.05, 0.1, . . . , 0.95}. We observe that the mixed wealth process follows closely the best of constant-betting strategies with λ {0.5, 0.95}.

While mixing often works well in practice, it introduces additional tuning hyperparameters, e.g., grid size. We consider two compelling approaches for the selection of betting fractions in a predictable way, meaning that λt depends only on {(Xi, Yi)}i 2(t 1). In addition to the ONS strategy (Algorithm 1), we also consider a GRAPA strategy (Algorithm 5). The idea that effective betting strategies are ones that maximize a gambler s expected log capital dates back to early works of Kelly (1956) and Breiman (1962). Assuming that the same betting fraction is used, the log capital after round (t 1) is

log Kt 1(λ) =

i=1 log (1 + λfi(Z2i 1, Z2i)) .

Algorithm 5 a GRAPA strategy for selecting betting fractions

Input: sequence of payoffs (ft(Z2t 1, Z2t))t 1, λa GRAPA 1 = 0, µ(1) 0 = 0, µ(2) 0 = 1, c = 0.9. for t = 1, 2, . . . do

Set µ(1) t = µ(1) t 1 + ft(Z2t 1, Z2t); Set µ(2) t = µ(2) t 1 + (ft(Z2t 1, Z2t))2;

Set λa GRAPA t+1 = c 0 µ(1) t /µ(2) t ; end for

6Practically, it is advisable to start with a coarse grid (small J) at small t and occasionally add another grid point, so that the grid becomes finer over time. Whenever a grid point is added, it is like adding another stock to a portfolio, and the wealth must be appropriately redistributed; we omit the details for brevity.

Sequential Kernelized Independence Testing

Following Waudby-Smith & Ramdas (2023), we set the derivative to zero and use Taylor s expansion to get

λa GRAPA t =

Pt 1 i=1 fi(Z2i 1, Z2i) Pt 1 i=1 (fi(Z2i 1, Z2i))2

Truncation at zero is inspired by the fact that EH1 [f (Z2t 1, Z2t) | Ft 1] > 0, whereas truncation at c (0, 1] (e.g., c = 0.9) is necessary to guarantee that the wealth process is indeed nonnegative.

Sequential Kernelized Independence Testing

D. Omitted Details for Sections 2 and 3

In this section, we complement the material presented in the main paper by deriving the forms of the witness functions for the dependence criteria considered in this work.

Oracle Witness Function for HSIC. Let us derive the form of the oracle witness function for HSIC. Note that:

sup g: g G H 1 [EPXY [g(X, Y )] EPX PY [g(X , Y )]]

= sup g: g G H 1 [EPXY [ g, φ(X) ψ(Y ) G H] EPX PY [ g, φ(X ) ψ(Y ) G H]]

= sup g: g G H 1 [ g, EPXY [φ(X) ψ(Y )] G H g, EPX PY [φ(X ) ψ(Y )] G H]

= sup g: g G H 1 [ g, µXY G g, µX µY G H]

= sup g: g G H 1 g, µXY µX µY G H,

from which it is easy to derive the oracle witness function for HSIC. Remark D.1. Note that in (13) the witness function is defined as an operator: ˆgt : X Y R. To clarify, for any z = (x, y) X Y, we have

(ˆµXY ˆµX ˆµY )(z) = 1 2(t 1)

i=1 k(Xi, x)l(Yi, y)

i=1 k(Xi, x)

i=1 l(Yi, y)

and the denominator in (13) can be expressed in terms of kernel matrices K, L R2(t 1) 2(t 1) with entries Kij = k(Xi, Xj), Lij = l(Yi, Yj), i, j {1, . . . , 2(t 1)}, as:

ˆµXY ˆµX ˆµY G H = 1 2(t 1)

where H = I2(t 1) (1/(2(t 1))11 is the centering projection matrix. Remark D.2. While the empirical witness functions for COCO/KCC (21) are defined as operators, we use those as functions in the definition of the corresponding payoff function. To clarify, for any x X and y Y, we have

k(Xi, x) 1 2(t 1)

j=1 k(Xj, x)

l(Yi, y) 1 2(t 1)

j=1 l(Yj, y)

Minibatched Payoff Function for HSIC. The minibatched payoff function at round t has the following form:

ft(Zb(t 1)+1, . . . , Zbt) = 1

i=1 ˆgt(Xb(t 1)+i, Yb(t 1)+i) 1 b(b 1)

ˆgt(Xb(t 1)+i, Yb(t 1)+j).

ft(Zb(t 1)+1, . . . , Zbt) = 1

i=1 ˆgt, φ(Xb(t 1)+i) ψ(Yb(t 1)+i) G H

ˆgt, φ(Xb(t 1)+i) ψ(Yb(t 1)+j) G H

Sequential Kernelized Independence Testing

ˆgt, 1 2b(b 1)

φ(Xb(t 1)+i) φ(Xb(t 1)+j) ψ(Yb(t 1)+i) ψ(Yb(t 1)+j) +

Let F t 1 = σ({(Xi, Yi)}i b(t 1)). We have that:

E ft(Zb(t 1)+1, . . . , Zbt) | F t 1 = ˆgt, µXY µX µY G H,

and in particular, EH0 ft(Zb(t 1)+1, . . . , Zbt) | F t 1 = 0 if the null H0 in (1a) is true. It suffices to show that the payoff is bounded. Since ˆgt G H = 1, we can easily deduce that:

ft(Zb(t 1)+1, . . . , Zbt) 1 2b(b 1)

φ(Xb(t 1)+i) φ(Xb(t 1)+j) ψ(Yb(t 1)+i) ψ(Yb(t 1)+j) G H

= 1 2b(b 1)

φ(Xb(t 1)+i) φ(Xb(t 1)+j) G ψ(Yb(t 1)+i) ψ(Yb(t 1)+j) H

= 1 2b(b 1)

2(1 k(Xb(t 1)+i, Xb(t 1)+j)) q

2(1 l(Yb(t 1)+i, Yb(t 1)+j))

Hence, we conclude that the wealth process constructed using a minibatched version of the payoff function is also a nonnegative martingale.

Example 3. For t 1, consider

(Xt, Yt) = Vt + 1 1/t

2 , V t + 1 1/t

where Vt, V t iid Ber(1/2). Note that X = Y [0, 1], which means that a pair of linear kernels, k(x, x ) = xx and l(y, y ) = yy are nonnegative and bounded by one on X and Y respectively. Note that for a linear kernel,

ˆgt(x, y) = ˆgt x y.

ft((X2t 1, Y2t 1), (X2t, Y2t)) = ˆgt

2 (X2t X2t 1) (Y2t Y2t 1)

V2t V2t 1 + 1 2t(2t 1)

V 2t V 2t 1 + 1 2t(2t 1)

In particular, E [ft((X2t 1, Y2t 1), (X2t, Y2t)) | Ft 1] = 0, implying that the wealth process (Kt)t 0 is no longer a nonnegative martingale.

Witness Functions for COCO. Let Φ and Ψ be a pair of matrices whose columns represent embeddings of X1, . . . , X2(t 1) and Y1, . . . , Y2(t 1), that is, φ(Xi) = k(Xi, ) and ψ(Yi) = l(Yi, ) for i = 1, . . . , 2(t 1). Recall that

φ(Xi) 1 2(t 1)

ψ(Yi) 1 2(t 1)

Sequential Kernelized Independence Testing

where H = I2(t 1) 1 2(t 1)11 is the centering projection matrix. We have

g, ˆCXY h G = 1 2(t 1)(α HΦ )(ΦHΨ )(ΨHβ) = 1 2(t 1)α HKHLHβ = 1 2(t 1)α K Lβ,

g 2 G = α Kα,

h 2 H = β Lβ,

where K := HKH and L := HLH are centered kernel matrices. Hence, the maximization problem in (20) can be expressed as:

max α,β 1 2(t 1)α K Lβ

subject to α Kα = 1, β Lβ = 1. (47)

After introducing Lagrange multipliers, it can then be shown that α and β, which solve (47), exactly correspond to the generalized eigenvalue problem (22).

Witness Functions for KCC. Introduce empirical covariance operators:

ˆCX = 1 2(t 1)

i=1 φ(Xi) φ(Xi)

= 1 2(t 1)ΦHΦ ,

i=1 ψ(Yi) ψ(Yi)

= 1 2(t 1)ΨHΨ .

Then the empirical variance terms can be expressed as:

ˆV [g(X)] = g, ˆCXg G = 1 2(t 1)(α HΦ )(ΦHΦ )(ΦHα) = 1 2(t 1)α K2α,

ˆV [h(Y )] = h, ˆCY h H = 1 2(t 1)(β HΨ )(ΨHΨ )(ΨHβ) = 1 2(t 1)β L2β.

Thus, an empirical estimator of the kernel canonical correlation (23) can be obtained by solving:

max α,β 1 2(t 1)α K Lβ

subject to 1 2(t 1)α K2α + κ1α Kα = 1,

1 2(t 1)β L2β + κ2β Lβ = 1.

After introducing Lagrange multipliers, it can then be shown that α and β, which solve (23), correspond to the generalized eigenvalue problem: 0 1 2(t 1) K L 1 2(t 1) L K 0

κ1 K + 1 2(t 1) K2 0 0 κ2 L + 1 2(t 1) L2

Sequential Kernelized Independence Testing

E. Additional Simulations

This section contains: (a) additional experiments on synthetic dataset and (b) data visualizations of the datasets used in this paper.

E.1. Test of Instantaneous Dependence

In Figure 10, we demonstrate it is hard to visually tell the difference between independence and dependence under distribution drift setting (2). See Example 1 for details.

0 20 40 60 80 100 Time t

(a) Independent noise.

0 20 40 60 80 100 Time t

(b) Dependent noise.

Figure 10. Sample of independent (subplot (a)) and dependent (ρ = 0.5, subplot(b)) data according to (3). The purpose of visualizing raw data is to demonstrate that dependence is hard to detect visually, and dependence refers to more than temporal correlation which may be present due to cyclical trends.

E.2. Distribution Drift

In this section, we consider the linear Gaussian model with an underlying distribution drift:

Yt = Xtβt + εt, Xt, εt N(0, 1), t 1,

that is, in contrast to the Gaussian linear model (Section 3), βt changes over time. We gradually increase it from βt = 0 to βt = 0.1 in increments of 0.02, that is:

β0, . . . , βb 1 | {z } =0 , βb, . . . , β2b 1 | {z } =0.02 , . . . , β5b 1, . . . | {z } =0.1

and, starting with β5b, we keep it equal to 0.1. We consider b {100, 200, 400} as possible block sizes. Note that there is a transition from independence (first b datapoints in a stream) to dependence. In Figure 11, we show that our test performs well under the distribution drift setting and consistently detects dependence.

E.3. Symmetry-based Payoff Functions

In this section, we complement the comparison presented in Section 4 between the rankand composition-based betting strategies (since those require minimal tuning) used with ONS or a GRAPA criteria for selecting betting fractions. We also increase the monitoring horizon to 20000 datapoints. In Figure 12(a), we consider the Gaussian linear model, but in contrast to the setting considered in Section 4, we focus on harder testing settings by considering β [0, 0.3]. In Figure 12(b), we compare compositionand rank-based approaches when data are sampled from the spherical model. In both cases, composition and rank-based approaches are similar; none of the payoffs uniformly dominates the other. We also observe that selecting betting fractions via a GRAPA criterion tends to result in a bit more powerful testing procedure.

Sequential Kernelized Independence Testing

0 1000 2000 3000 4000 5000 Number of pairs

Rejection rate

b = 100 b = 200 b = 400 α = 0.05

Figure 11. Rejection rate of sequential independence test under distribution drift setting. Focusing on the non-i.i.d. time-varying setting, we confirm that our test has high power under the alternative.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 β

( ) Rejection rate (- -) Sample budget used to reject the null

f rank t + ONS

f odd t + ONS

f rank t + a GRAPA

f odd t + a GRAPA α = 0.05

4 6 8 10 12 14 d

( ) Rejection rate (- -) Sample budget used to reject the null

f rank t + ONS

f odd t + ONS

f rank t + a GRAPA

f odd t + a GRAPA α = 0.05

Figure 12. (a) Comparison of symmetry-based betting strategies under the Gaussian model. The betting strategy based on composition with an odd function performs only slightly better than the rank-based strategy. (b) SKIT with compositionand rank-based betting strategies under the spherical model. None of the betting strategies uniformly dominates the other. a GRAPA criterion for selecting betting fractions tends to result in a bit more powerful testing procedure.

E.4. Hard-to-detect Dependence

Hard-to-detect dependence. Consider the joint density p(x, y) of the form:

1 4π2 (1 + sin(wx) sin(wy)) 1 (x, y) [ π, π]2 . (48)

With the null case corresponding to w = 0, the testing problem becomes harder with growing w. In Figure 13, we illustrate the densities and a data sample for the hard-to-detect setting (48).

We use λX = λY = 3/(4π2) as RBF kernel hyperparameters. For visualization purposes, we stop monitoring after observing 20000 datapoints from PXY , and if a SKIT does not reject H0 by that time, we assume that the null is retained. The results are aggregated over 200 runs for each value of w. In Figure 14, where the null case corresponds to w = 0, we confirm that SKITs have time-uniform type I error control. The average rejection rate starts to drop for w 3, meaning that observing 20000 points from PXY does not suffice to detect dependence.

Sequential Kernelized Independence Testing

3 2 1 0 1 2 3 3

2.509 10 14 + 1

3 2 1 0 1 2 3 3

3 2 1 0 1 2 3 3

Figure 13. Visualization of the densities (top) and a dataset of size 5000 (bottom) sampled from the corresponding distribution.

0 2 4 6 8 10 w

( ) Rejection rate (- -) Sample budget used to reject the null

HSIC COCO KCC α = 0.05

Figure 14. Rejection rate (solid) and fraction of samples used before the null hypothesis was rejected (dashed) for hard-to-detect dependence model. By inspecting the rejection rate for w = 0 (independence holds), we confirm that the type I error is controlled. Further, SKIT is adaptive to the complexity of a problem (larger w corresponds to a harder setting).

E.5. Additional Results for Real Data

In Figure 15, we illustrate that the average daily temperature in selected cities share similar seasonal patterns. We repeat the same experiment as in Section 4, but for four cities in South Africa: Cape Town (CT), Port Elizabeth (PE), Durban (DRN), and Bloemfontein (BFN). In Figures 15(d) and 15(e), we illustrate the resulting wealth processes for each pair of cities and for each region. Finally, we illustrate the pairs of cities for which the null has been rejected in Figure 15(c).

Sequential Kernelized Independence Testing

15 20 25 30 35 Longitude

Cape Town (CT) Port Elizabeth (PE)

Durban (DRN)

Bloemfontein (BFN)

0 200 400 600 800 1000 Days from the start date

LDN vs AMS ZRH vs AMS NCE vs ZRH 6/α = 120

LDN vs ZRH LDN vs NCE NCE vs AMS

0 200 400 600 800 1000 Days from the start date

PE vs DRN BFN vs DRN CT vs BFN 6/α = 120

CT vs PE PE vs BFN CT vs DRN

Figure 15. Temperatures for selected cities in Europe (subplot (a)) and South Africa (subplot (b)) share similar seasonal patterns. Map (subplot (c)) where solid red lines connect those cities for which the null is rejected. SKIT supports our conjecture about dependent temperature fluctuations for geographically close cities. For completeness, we also plot wealth processes for SKIT used on weather data for Europe (subplot (d)) and South Africa (subplot (e)).

Sequential Kernelized Independence Testing

E.6. Experiment with MNIST data

In this section, we analyze the performance of SKIT on high-dimensional real data. This experiment is based on MNIST dataset (Le Cun et al., 1998) where pairs of digits are observed at each step; under the null one sees digits (a, b) where a and b are uniformly randomly chosen, but under the alternative one sees (a, a ), i.e., two different images of the same digit. To estimate kernel hyperparameters, we deploy the median heuristic using 20 pairs of images.

We illustrate the results in Figure 16. Under the null, our test does not reject more often than the required 5%, but its power increases with sample size under the alternative, reaching power one after processing 500 pairs of digits (points from PXY ) on average.

0 100 200 300 400 500 Number of Pairs

Rejection Rate

H1 is true H0 is true α = 0.05

Figure 16. Rejection rate for SKIT on MNIST data. Under the null (red dashed line), our test does not reject more often than the required 5%, but its power increases with sample size under the alternative (blue solid line). Each pair corresponds to two points from PXY , and hence, SKIT reaches power one after processing 500 pairs of images on average.

Sequential Kernelized Independence Testing

F. Scaling Sequential Testing Procedures

Updating the wealth process at each round requires evaluating the payoff function at a new pair of observations (and hence computing the witness function corresponding to a chosen dependence criterion). In this section, we provide details about the ways of reducing the computational complexity of this step, which are necessary to scale the proposed sequential testing frameworks to moderately large sample sizes. Note that the proposed implementation of COCO allows updating kernel hyperparameters on the fly. In contrast, linear-time updates for HSIC require fixing kernel hyperparameters in advance.

F.1. Incomplete/Pivoted Cholesky Decomposition for COCO and KCC

Suppose that we want to evaluate COCO payoff function on the next pair of points (X2t 1, Y2t 1), (X2t, Y2t). In order to do so, we need to compute g1,t and g2,t, that is solve the generalized eigenvalue problem. Note that solving generalized eigenvalue problem at each iteration could be computationally prohibitive. One simple way is to use a random subsample of datapoints when performing witness function estimation, e.g., once the sample size n exceeds ns, e.g., ns = 25, we randomly subsample (without replacement) a sample of size ns to estimate witness functions. Alternatively, a common approach is to reduce computational burden through incomplete Cholesky decomposition. The idea is to use the fact that kernel matrices tend to demonstrate rapid spectrum decay, and thus low-rank approximations can be used to scale the procedures. Suppose that K G1GT 1 and L G2GT 2 where Gi s are lower triangular matrices of size n M (M depends on the preset approximation error level). After computing Cholesky decomposition, we center both matrices via left multiplication by H and compute SVDs of HG1 and HG2, that is, HG1 = U1Λ1V 1 and HG2 = U2Λ2V 2 . We have:

K U1Λ2 1U 1 , L U2Λ2 2U 2 .

Our goal is to find the largest eigenvalue/eigenvector pair for Ax = γBx for a PD matrix B. Since:

Ax = γBx B 1/2AB 1/2(B1/2x) = γ(B1/2x),

it suffices to leading eigenvalue/eigenvector pair for:

B 1/2AB 1/2y = γy.

Then x = B 1/2y is a generalized eigenvector for the initial problem.

COCO. For COCO, we have:

B = K 0 0 L

U1Λ2 1U 1 0 0 U2Λ2 2U 2

= U1 0 0 U2

Λ2 1 0 0 Λ2 2

= B 1/2 U1 0 0 U2

Λ 1 1 0 0 Λ 1 2

We also have:

A 0 1 n U1Λ2 1U 1 U2Λ2 2U 2 1 n U2Λ2 2U 2 U1Λ2 1U 1 0

= U1 0 0 U2

0 1 nΛ2 1U 1 U2Λ2 2 1 nΛ2 2U 2 U1Λ2 1 0

Thus we have:

B 1/2AB 1/2 U1 0 0 U2

0 1 nΛ1U 1 U2Λ2 1 nΛ2U 2 U1Λ1 0

Hence, we only need to compute the leading eigenvector (say, z ) for: 0 1 nΛ1U 1 U2Λ2 1 nΛ2U 2 U1Λ1 0

R(M1+M2) (M1+M2).

It implies that the leading eigenvector for B 1/2AB 1/2 is then Uz , and the solution for the generalized eigenvalue problem is given by:

UΛ 1z = U1Λ 1 1 z 1 U2Λ 1 2 z 2

Sequential Kernelized Independence Testing

Next, we need to normalize this vector of coefficients appropriately, i.e., we need to guarantee that K1/2α 2 = 1 and L1/2β 2 = 1, and thus re-normalizing naively is quadratic in n. Instead, note that in order to compute incomplete Cholesky

decomposition, we choose a tolerance parameter δ so that: PKP G1G 1 = K G1G 1 δ (nuclear norm). Let = K G1G 1 . We know that:

α Kα = α HKHα = α H( + G1G 1 )Hα = α H Hα + α HG1G 1 Hα

First, note that α H Hα δ Hα 2 2. Next, G 1 H = V1Λ1U 1 .

Given an initial vector of parameters α0 and β0, vectors of coefficients can be normalized in linear time using

α = α0 q G 1 Hα0 2 2 + δ Hα0 2 2

= U1Λ 1 1 z 1 q

V1z 1 2 2 + δ Hα0 2 2 = U1Λ 1 1 z 1 q

z 1 2 2 + δ Hα0 2 2 ,

β = β0 q G 2 Hβ0 2 2 + δ Hβ0 2 2

= U2Λ 1 2 z 2 q

V2z 2 2 2 + δ Hβ0 2 2 = U2Λ 1 2 z 2 q

z 2 2 2 + δ Hβ0 2 2 .

For small δ, we essentially normalize by α 0 Kα0 and β 0 Lβ0 as expected. It also makes sense to use δ = n δ0. Still, re-estimating the witness functions after processing 2t, t 1 points is computationally intensive. In contrast to HSIC, for which there are no clear benefits of skipping certain estimation steps, for COCO we estimate the witness functions after processing 2t2, t 1 points.

KCC. For KCC, we have:

B = κ1 K + 1

n K2 0 0 κ2 L + 1

κ1U1Λ2 1U 1 + 1

n U1Λ4 1U 1 0 0 κ2U2Λ2 2U 2 + 1

n U2Λ4 2U 2

= U1Λ2 1 κ1In + Λ2 1 1 n U 1 0 0 U2Λ2 2 κ2In + Λ2 2 1 n U 2

= U Λ2 1 κ1In + Λ2 1 1 n 0 0 Λ2 2 κ2In + Λ2 2 1 n U ,

which implies that:

κ1In + Λ2 1 1 n 1/2 Λ 1 1 0 0 κ2In + Λ2 2 1 n 1/2 Λ 1 2

Recall that:

A U 0 1 nΛ2 1U 1 U2Λ2 2 1 nΛ2 2U 2 U1Λ2 1 0

B 1/2AB 1/2 U 0 M M 0

where M = 1

κ1In + Λ2 1 1 n

1/2 Λ1U 1 U2Λ2

κ2In + Λ2 2 1 n

Equivalently,

nρκ1(Λ1)Λ1U 1 U2Λ2ρκ2(Λ2), where ρκ(x) = 1 p

Hence, we only need to compute the leading eigenvector (say, z ) for: 0 M M 0

Sequential Kernelized Independence Testing

It implies that the leading eigenvector for B 1/2AB 1/2 is then Uz . For the initial generalized eigenvalue problem, an approximate solution (due to using low-rank approximations of kernel matrices) is given by:

B 1/2Uz = U1ρκ1(Λ1)Λ 1 1 z 1 U2ρκ2(Λ2)Λ 1 2 z 2

= U1Λ 1 1 ρκ1(Λ1)z 1 U2Λ 1 2 ρκ2(Λ2)z 2

Next, we need to normalize this vector of coefficients appropriately, i.e., we need to guarantee that K1/2α 2 = 1 and L1/2β 2 = 1, and thus re-normalizing naively is quadratic in n. Instead, note that in order to compute incomplete Cholesky

decomposition, we choose a tolerance parameter δ so that: PKP G1G 1 = K G1G 1 δ (nuclear norm). Let = K G1G 1 . We know that:

α Kα = α HKHα = α H( + G1G 1 )Hα = α H Hα + α HG1G 1 Hα

First, note that α H Hα δ Hα 2 2. Next, G 1 H = V1Λ1U 1 .

Given an initial vector of parameters α0 and β0, vectors of coefficients can be normalized in linear time using

α = α0 q G 1 Hα0 2 2 + δ Hα0 2 2

= U1ρκ1(Λ1)Λ 1 1 z 1 q

V1ρκ1(Λ1)z 1 2 2 + δ Hα0 2 2 = U1ρκ1(Λ1)Λ 1 1 z 1 q

ρκ1(Λ1)z 1 2 2 + δ Hα0 2 2 ,

β = β0 q G 2 Hβ0 2 2 + δ Hβ0 2 2

= U2ρκ2(Λ2)Λ 1 2 z 2 q

V2ρκ2(Λ2)z 2 2 2 + δ Hβ0 2 2 = U2ρκ2(Λ2)Λ 1 2 z 2 q

ρκ2(Λ2)z 2 2 2 + δ Hβ0 2 2 .

F.2. Linear-time Updates of the HSIC Payoff Function

Suppose that we want to evaluate HSIC payoff function on the next pair of points (X2t+1, Y2t+1), (X2t+2, Y2t+2). In order to do so, we need to compute: ˆgt(X2t+2, Y2t+2). It is clear that the computational of evaluating ˆµXY (x, y) and (ˆµX ˆµY )(x, y) on a given pair (x, y) is linear in t. However, we also need to compute the normalization constant:

ˆµXY ˆµX ˆµY G H . (49)

Recall that: ˆµ(t) XY ˆµ(t) X ˆµ(t) Y 2

G H = 1 (2t)2 tr K(t)H(t)L(t)H(t) ,

where K(t) and L(t) are kernel matrices corresponding to the first 2t pairs, H(t) := I2t 1

2t12t1 2t. Instead of computing the normalization constant naively, we next establish a more efficient way of computing (49) in time linear in t by caching certain values. Introduce:

i,j=1 Kij Lij = tr K(t)L(t) ,

i,j Kij = 1 2t K(t)12t,

i,j Lij = 1 2t L(t)12t,

j,q=1 Kij Liq = 1 2t K(t)L(t)12t.

We have: ˆµ(t+1) XY ˆµ(t+1) X ˆµ(t+1) Y 2

G H = 1 (2t + 2)2 (t+1) 1 + 1 (2t + 2)4 (t+1) 2 (t) 3 2 (2t + 2)3 (t+1) 4 .

Sequential Kernelized Independence Testing

Next, we show how to speed up computations via caching certain intermediate values. Kernel matrices have the following structure:

K(t) K ,2t+1 K ,2t+2 K ,2t+1 K2t+1,2t+1 K2t+1,2t+2 K ,2t+2 K2t+2,2t+1 K2t+2,2t+2

L(t) L ,2t+1 L ,2t+2 L ,2t+1 L2t+1,2t+1 L2t+1,2t+2 L ,2t+2 L2t+2,2t+1 L2t+2,2t+2

where K ,2t+1, K ,2t+2, L ,2t+1, L ,2t+2 R2t contain kernel function evaluations:

k(X1, Xm) ... k(X2t, Xm)

l(Y1, Ym) ... l(Y2t, Ym)

, m {2t + 1, 2t + 2} .

First, it is easy to derive that:

tr K(t+1)L(t+1) = tr K(t)L(t) + 2(L ,2t+1K ,2t+1) + 2(L ,2t+2K ,2t+2)+

+ K2t+1,2t+1L2t+1,2t+1 + K2t+2,2t+2L2t+2,2t+2 + K2t+1,2t+2L2t+2,2t+1 + K2t+2,2t+1L2t+1,2t+2.

Thus, if the value tr K(t)L(t) is cached, then tr K(t+1)L(t+1) can be computed in linear time. Note that:

K(t+1)12t+2 =

K(t)12t + k ,2t+1 + k ,2t+2 K ,2t+112t + K2t+1,2t+1 + K2t+1,2t+2 K ,2t+212t + K2t+2,2t+1 + K2t+2,2t+2

which can be computed in linear time if K(t)12t is stored (similar result holds for L(t+1)12t+2). It thus follows that 1 2t+2K(t+1)12t+2, 1 2t+2L(t+1)12t+2 and 1 2t+2K(t+1)L(t+1)12t+2 can all be computed in linear time. To sum up, we need to cache tr K(t)L(t) , K(t)12t, L(t)12t to compute the normalization constant in linear time.