# collaborative_likelihoodratio_estimation_over_graphs__9bc18e42.pdf

Journal of Machine Learning Research 26 (2025) 1-66 Submitted 4/24; Revised 10/25; Published 11/25

Collaborative likelihood-ratio estimation over graphs

Alejandro de la Concha alejandro.de la concha duarte@ens-paris-saclay.fr Universit e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190 Gif-sur-Yvette, France Department of Mathematics, University of Luxembourg, 4364 Esch-sur-Alzette, Luxembourg

Nicolas Vayatis nicolas.vayatis@ens-paris-saclay.fr Universit e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190 Gif-sur-Yvette, France

Argyris Kalogeratos argyris.kalogeratos@ens-paris-saclay.fr Universit e Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190 Gif-sur-Yvette, France

Editor: Krishnakumar Balasubramanian

This paper introduces the Collaborative Likelihood-ratio Estimation problem, which is relevant for applications involving multiple statistical estimation tasks that can be mapped to the nodes of a ﬁxed graph expressing pairwise task similarity. Each graph node v observes i.i.d. data from two unknown node-speciﬁc pdfs, pv and qv, and the goal is to estimate the likelihood-ratios (or density-ratios), rv(x) = qv(x)

pv(x), for all v. Our contribution is multifold: we present a non-parametric collaborative framework that leverages the graph structure of the problem to solve the tasks more eﬃciently; we present a concrete method that we call Graph-based Relative Unconstrained Least-Squares Importance Fitting (GRULSIF) along with an eﬃcient implementation; we derive convergence rates that highlight the role of the main variables of the problem. Our theoretical results explicit the conditions under which the collaborative estimation leads to performance gains compared to solving each estimation task independently. Finally, in a series of experiments, we demonstrate that the joint likelihood-ratio estimation of GRULSIF at all graph nodes is more accurate compared to state-of-the-art methods that operate independently at each node, and we verify that the behavior of GRULSIF is in agreement with our theoretical analysis. Keywords: Unsupervised learning, f-divergence, likelihood-ratio estimation, kernel methods, graph regularization, multitask learning.

1. Introduction

The quantiﬁcation and statistical validation of the diﬀerence between two probability measures, P and Q, is a fundamental problem in Machine Learning and Statistics. The likelihood-ratio between the associated pdfs, r(x) = q(x)

p(x), can serve this purpose and has been used in designing statistical tests and detectors (Tartakovsky et al., 2014; Lehmann and Romano, 2022). In many practical cases where p and q are unknown and only sampled data are available from each of them, r( ) needs to be estimated from the available data via non-parametric methods that require minimal assumptions about p and q. A critical component in likelihood-ratio estimation (LRE) is the variational representation of f-divergences (φ-divergence), which shows how measuring the dissimilarity between P and Q, that is Dφ(P Q), is equivalent to solving a functional optimization problem (Nguyen et al., 2008, 2010; Sugiyama et al., 2012; Agrawal and Horel, 2021; Birrell et al., 2022a). Interestingly,

c 2025 Alejandro de la Concha, Nicolas Vayatis, Argyris Kalogeratos.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-0565.html.

de la Concha, Vayatis and Kalogeratos

Figure 1: Collaborative LRE over a graph. Given observations from two probabilistic models pv (blue) and qv (pink) at each node v of a graph (left), the proposed Collaborative LRE framework aims at estimating jointly all the N associated likelihood-ratios rv (right) in a collaborative and distributed manner. This example shows how using only the input data X R2, essentially a vector-valued function r( ) = (r1( ),...,r N( )) is derived holding the likelihood-ratios.

under some conditions on P, Q, and Dφ(P Q), the solution to this optimization problem leads to an approximation of the likelihood-ratio.

The existing research has focused on the single LRE problem, however, modern challenges are posed when multiple entities need to solve similar problems using local data samples. For instance, each entity may correspond to an agency or sensor collecting local data for diﬀerent geographic areas, e.g. meteorological data, air pollution sensor networks, medical surveys across geographic regions. Then, one s interest may be to estimate a likelihood-ratio at each location, which can be subsequently used in application tasks. In such application settings, Collaborative LRE could play a crucial role by ensuring that the heterogeneity among what each entity observes locally will not get diluted by any sort of global data aggregation. Under certain conditions, this approach can enhance the estimation of the likelihood-ratios associated to each entity, hence the performance in the application task of one s interest, as compared to ignoring the graph structure of the problem. Therefore, the intriguing question we put forward in this work is how can these entities collaborate to solve their LRE tasks better than operating independently on their own.

Contribution. This work introduces the Collaborative LRE problem for multiple data sources: each data source v, represented as a node in a ﬁxed graph, intends to compare two node-speciﬁc pdfs, pv and qv, using local i.i.d. observations. The novelty lies in the collaborative estimation of the likelihood-ratios, rv(x) = qv(x)

pv(x), instead of independently at each node. Our fundamental hypothesis is that the graph structure conveys valuable information about how similar are expected to be the estimation tasks at any two nodes. A visual summary of the problem is provided in Fig. 1. We also present an algorithm for the Collaborative LRE problem, called Graph-based Relative Unconstrained Least Squares Importance Fitting (GRULSIF). As in previous works in LRE, GRULSIF is deﬁned in terms of the variational representation of φ-divergences (Nguyen et al., 2008, 2010). In this graph-based extension where multiple likelihood-ratios need to be estimated, we consider a functional space of vector-valued functions whose geometry encodes the node similarity hypothesis. Our framework assumes that the likelihood-ratios we aim to estimate {rv}v V are elements of a Reproducing Kernel Hilbert Space (RKHS) H shared among all nodes,

Collaborative likelihood-ratio estimation over graphs

and that at two adjacent nodes u and v, the likelihood-ratios ru and rv will be close to each other in H. This graph-based setting has the following characteristics:

1. The Collaborative LRE is an instance of Multitask Learning, which has been shown to improve the generalization in supervised problems (Maurer, 2006a,b; Youseﬁet al., 2018; Zhang and Yang, 2021) and has motivated special and often distributed optimization schemes (Nassif et al., 2020a,b,c; Zhang and Yang, 2021).

2. The GRULSIF method is characterized by convergence rates that are derived thanks also to the available theoretical results in the literature. Moreover, we identify under which conditions the Collaborative LRE outperforms the independent LRE, as well as the sensitivity of the former to important problem variables, such as the amount of available data per node, the number of nodes, and whether prior information is provided in the form of a graph structure.

3. Our numerical implementation exploits the graph structure of the problem and a shared RKHS for all nodes, and scales well in the number of nodes. GRULSIF provides also guidelines for hyperparameter selection and its sensitivity to diﬀerent choices.

LRE applications and motivation for graph-based extensions. The interplay between φ-divergence and likelihood-ratio has led to several applications, such as Transfer Learning, Hypothesis Testing, and Change-point Detection. Transfer Learning relaxes the classical hypothesis that the training and the test datasets are samples of the same distribution, and relies instead on importance weighting that trains a predictive model by focusing on training losses that are weighted according to the test-over-training likelihood-ratio, r(x) = qtest(x)

ptrain(x) (Huang et al., 2006; Sugiyama et al., 2007; Yamada et al., 2013; Lu et al., 2023). In Hypothesis Testing, statistical tests based on φ-divergences are suitable when the forms of q and p are unknown, and only data samples are available. The test statistic takes the form of an approximated φ-divergence via empirical averages over estimated likelihoodratios (Sugiyama et al., 2011b, 2012; Yamada et al., 2013). Similarly, in non-parametric Change-point Detection the goal is to detect the moment at which a time-series changes behavior from p to q, which are both unknown (Liu et al., 2013; Ferrari et al., 2023). The workﬂow in such applications comprises two stages: the likelihood-ratio is ﬁrst estimated, and then it is used to compute application-speciﬁc scores, e.g. a weighting function or a test statistic. This explains the broad interest of the research community in generic LRE approaches that can implement the ﬁrst of the above stages. The graph-based counterpart of this problem introduced in this work is motivated by the recent technological advances that have increased the capacity to monitor many aspects of daily life or natural phenomena by collecting and analyzing data coming from multiple sources. Imagine, for example, a network of sensors monitoring air quality of a city, where sensors that are close to each other are likely to record similar signals. Using the data from each sensor and the similarities between them, one could design a two-sample test based on likelihood-ratio estimation to detect pollution peaks. In another example, the similarity between hospitals in terms of the patient proﬁles they receive, can be exploited by graphbased LRE to enable Transfer Learning and update their diagnostic algorithms in response to changes in the behavior of a disease. The emergence of such systems that are made of multiple sensors or agents, each of them generating their own data, poses the intriguing question how to adapt LRE approaches so

de la Concha, Vayatis and Kalogeratos

they integrate this heterogeneity while enabling collaboration. This challenge has already been mentioned in Sugiyama et al. (2012) (chapter: Conclusions and Future Directions ), where the need for such methods was acknowledged so that likelihood-ratio-based techniques to be applied to real-world problems such as sensory data, ﬁnance and neuroscience. Besides, Multitask Learning and Federated Learning literature acknowledges graph-based modeling as a promising research direction to address within their scope. The underlying hypothesis is that agents share similarities that can be modeled via graph-based techniques, and then those similarities can be leveraged to design more eﬃcient machine learning methods. This work is on this direction by proposing the GRULSIF framework that can enable applications such as the aforementioned. Worth noting, this framework has already been employed in works developed in parallel as foundation for Change-Point Detection and Two-Sample Testing that account for graph-structured data (de la Concha et al., 2023, 2025).

Organization of the paper. Sec. 2 introduces the Collaborative LRE problem and provides the building elements of our framework. Sec. 3 presents the main technical contribution of the paper, the GRULSIF method. Sec. 4 provides theoretical guarantees on GRULSIF s excess risk. Sec. 5 discusses the elements allowing an eﬃcient GRULSIF implementation. Sec. 6 illustrates the performance of GRULSIF in experiments on synthetic and real-life data. Our concluding remarks follow in Sec. 7. Technical details are in the Appendix.

2. Problem deﬁnition and background

In this section, we start by giving general notations and a problem statement. Then, we provide a brief background for each of the diverse technical elements we combine in this work, while at the same time justifying a series of choices that have been made.

General notations. Let ai be the i-th entry of a vector a; when the vector is itself indexed by j, we refer to its i-th entry by aj,i. Aij denotes the entry at the i-th row and j-th column of a matrix A, and Ai,: is its i-th row. 1M and 0M represent vectors with M ones and zeros, respectively, IM is the M M identity matrix. The Euclidean norm and dot product appear as and , , and when referring to a functional space H they are indexed as H and , H. We consider a ﬁxed, undirected and positive-weighted graph G = (V,E,W), without self-loops, deﬁned by the set V containing N nodes, the set of edges E, and a non-negative weight matrix W RN N such that Wuu = 0, u V , and Wuv = Wvu 0. The set containing the neighbors of node v is ng(v) = {u : Wuv = 0}, and the node degree is dv = |ng(v)| = P u 1{Wuv = 0}, where |set| N is the cardinality of a set, and 1{condition} {0,1} is the indicator function. The combinatorial graph Laplacian operator is deﬁned by L = diag((dv)v V ) W, where diag( ) is a diagonal matrix with the input elements in its diagonal. In the rest, composite objects (vectors, matrices, sets, etc.) referring to all the nodes of a graph appear in bold font.

2.1 Collaborative likelihood-ratio estimation (LRE): Problem statement

In a given ﬁxed, undirected, and positive-weighted graph G = (V,E,W), suppose each node v has independent and identically distributed (i.i.d.) observations from two unknown probability measures Pv and Qv (see the formal Assumption 2): nv observations from Pv, and respectively n v others from Qv. Let the measures Pv and Qv admit the pdfs pv and

Collaborative likelihood-ratio estimation over graphs

qv with respect to (w.r.t.) a reference measure ρ over a common input space X; then, the available observations are deﬁned as: ( X = {Xv}v V = {xv,1,...,xv,nv}

v V , v,i : xv,i i.i.d. pv, xv,i X;

X = {X v}v V = {x v,1,...,x v,n v}

v V , v,i : x v,i i.i.d. qv, x v,i X. (1)

Consider a LRE task at each node v, quantifying how diﬀerent pv and qv are by learning a node-speciﬁc model fv that approximates the likelihood-ratio function. The Collaborative LRE problem aims to learn jointly all the N models by leveraging the similarity between tasks of adjacent nodes (see Fig. 1). Note that for technical reasons, explained in Sec. 2.3 and later in the paper, instead of the typical likelihood-ratio function rv( ) = qv( )

pv( ), we propose to approximate the α-relative likelihood-ratio function expressed in Eq. 2.

2.2 Likelihood-ratio functions

Let (X,Ξ) be a measurable space. We denote by M+ σ (X) the set of positive σ-ﬁnite measures over (X,Ξ), M(X) refers to the space of all ﬁnite signed measures, and P(X) is the set of probability measures. Any ρ M(X) admits a unique decomposition as the diﬀerence of two positive measures, ρ = ρ+ ρ , where at least one of them is ﬁnite (Hahn decomposition theorem). Then, the total variation measure of ρ M(X) is deﬁned as |ρ| = ρ+ +ρ . For two probability measures P,Q P(X), the Radon-Nikodym derivative r = d Q

d P (x) exists iﬀQ P (consequence of the Radon-Nikodym theorem). When both P,Q admit a pdf, p and q, w.r.t. a reference measure ρ, the Radon-Nikodym derivative is rewritten as r(x) = d Q

d P (x) = q(x)

p(x). The quotient q(x)

p(x) is known as the likelihood-ratio or density-ratio. In practice, Q may not be absolutely continuous w.r.t. P (i.e. Q P), which implies that the Radon-Nikodym derivative does not exist and hence the LRE problem is ill-deﬁned. A ﬂexible workaround is the α-regularization that chooses a value 0 α 1 and introduces the convex combination of P and Q as a composite probability measure: P α = (1 α)P +αQ. An advantage is that setting α > 0 ensures Q P α; in addition, when P and Q admit a pdf p and q, the likelihood-ratio is replaced by the α-relative likelihood-ratio function (Yamada et al., 2011), rα : X R, indexed by α:

rα(x) = q(x) (1 α)p(x)+αq(x) < 1

α, for any 0 < α < 1. (2)

Notably, when α > 0, the ratio rα is always bounded above by 1/α and does not suﬀer from numerical problems typically faced when trying to approximate an unbounded function.

2.3 Likelihood-ratio estimation with φ-divergences

φ-divergence. φ-divergences oﬀer a uniﬁed view to various statistical problems, as they can be used both for the quantiﬁcation of the dissimilarity between two probability measures P and Q and for training probabilistic models. In the LRE context, variational representations of φ-divergences have been used to deﬁne valid surrogate cost functions such that estimating the likelihood-ratio to amount to solving an optimization problem deﬁned in a functional space F. This has the advantage of minimizing the required assumptions on P and Q, and allows the direct approximation of the likelihood-ratio. In most applications, for eﬃcient

de la Concha, Vayatis and Kalogeratos

estimation it is more intuitive to impose regularity conditions on the likelihood-ratio via the geometry of the functional space F, rather than specifying explicitly P and Q. Formally, a φ-divergence is a positive measure that quantiﬁes the dissimilarity between two probability measured P and Q deﬁned over an input space X:

Z φ d Q d P (x)d P(x), if Q P;

+ , if Q P, (3)

where φ : R R is a proper closed convex function from ( , ) to [0, ] with φ(1) = 0, and such that its domain Dom(φ) = {x R|φ(x) < } is an interval with endpoints aφ < 1 < bφ (Csisz ar, 1967; Broniatowski and Keziou, 2006; Birrell et al., 2022a). If Dφ(P Q) 0 holds, and if φ is strictly convex at 1, then Dφ(P Q) = 0 iﬀP = Q. Notably, for z R, setting φ(z) = log(z) recovers the KL-divergence (Kullback, 1959), while φ(z) = 1

2(z 1)2 recovers Pearson s χ2-divergence (Pearson, 1900).

Variational representation of φ-divergences. Deriving valid variational representations is an active research topic and there are several formulations built using diﬀerent assumptions (Broniatowski and Keziou, 2006; Nguyen et al., 2008, 2010; Agrawal and Horel, 2021; Birrell et al., 2022a,b; Bach, 2024). In this work, we focus on the following result.

Theorem 1 (Theorem 4.3 in Broniatowski and Keziou (2006)). Let F be some class of measurable real-valued functions deﬁned on X, B the set of all bounded measurable realvalued functions deﬁned on X, and span(F B) the set of all linear combinations of the set F B; let also the set:

Z |f|d|Q| < , f F , (4)

where |Q| denotes the total variation of the signed ﬁnite measure Q. Assume that φ is diﬀerentiable. Then, for all Q MF, such that Dφ(P Q) is ﬁnite and φ (d Q

d P ) belongs to span(F B), Dφ(P Q) admits the dual representation:

Dφ(P Q) = sup g span(F B)

Z g(x )d Q(x ) Z φ (g)(x)d P(x) (5)

where φ denotes the convex conjugate of φ. And the function g = φ (d Q

d P ) is a dual optimal solution. Furthermore, if φ is essentially smooth, then g is the unique dual solution Palmost everywhere.

Solving Problem 5 to estimate the φ-divergence, depends on the functional space F and the set deﬁned by the subdiﬀerential of φ, which is evaluated at the likelihood-ratio r(x) for x X. This link has been exploited to deﬁne convex functional optimization problems that ﬁrst estimate the likelihood-ratio and then the associated φ-divergence (Nguyen et al., 2008, 2010; Sugiyama et al., 2012). This approach makes no hypothesis about the form of q and p, hence leads to non-parametric algorithms that only need data observations coming from p and q. Note that we write Dα φ(P Q) := Dφ(P α Q) to deﬁne a φ-divergence in the form of Eq. 3, but in terms of the relative likelihood-ratio rα. Moreover, Dα φ(P Q) is a valid

Collaborative likelihood-ratio estimation over graphs

dissimilarity function, since Dα φ(P Q) 0, and when φ is strictly convex around 1 it can be veriﬁed that Dα φ(P Q) = 0 iﬀP = Q. Then, the result of Theorem 1 links the variational formulation of Dα φ(P Q) and the relative likelihood-ratio rα. To address the Collaborative LRE problem stated in Sec. 2.1, we deﬁne a surrogate cost function based on the variational representation of Dα φ(Pv Qv) to jointly estimate all

{rα v }v V , i.e. one (relative) likelihood-ratio function rα v (x) = qv(x) (1 α)pv(x)+αqv(x) R for each node over a vector-valued functional space G. We choose a normed vector space G whose norm encodes the given information regarding the similarity between the graph nodes, which in this case amounts to the similarity between their likelihood-ratio functions {rα v }v V .

2.4 Non-parametric estimation

LRE can be addressed using diﬀerent functional spaces, e.g. Neural Networks (Nowozin et al., 2016; Rhodes et al., 2020; Kato and Teshima, 2021) or Reproducing Kernel Hilbert Spaces (RKHSs) (Sugiyama et al., 2007; Nguyen et al., 2010; Yamada et al., 2013; Kanamori et al., 2011; Zellinger et al., 2023; de la Concha et al., 2024; Nguyen et al., 2024). The choice depends on the availability of data, the computational resources, or other prior information about the likelihood-ratio. In this work, we focus on RKHSs as they oﬀer numerous comparative advantages: they provide geometrical operations deﬁned in Hilbert spaces that facilitate the estimation and theoretical analysis; they allow us to learn in rich inﬁnite-dimensional spaces, and the complexity of the approximated function can be elegantly expressed by the norm H. Traditional Kernel Methods model scalar functions in a RKHS space associated with a positive deﬁnite kernel. In our context, we would like to approximate the vector-valued function rα( ) = (rα 1 ( ),...,rα N( )) HN, where each dimension is associated with a node of the graph. Moreover, we would like the functional space to be rich enough to approximate all those N likelihood-ratios. This is possible via a Vector-Valued Reproducing Kernel Hilbert Space (VV-RKHS), which is a multivariate generalization of the scalar RKHS (Micchelli and Pontil, 2005; Carmeli et al., 2006; Alvarez et al., 2012). Formal deﬁnitions follow.

Scalar RKHS. Let X be a set, and H RX a class of functions forming a real Hilbert space with inner-product , H. K : X X R is a reproducing kernel function of H if:

1. H contains all functions of the form: x X, K(x, ) : t K(x,t).

2. For every x X and f H the reproducing property holds: f(x) = f, K(x, ) H.

If a reproducing kernel K exists, then H is called a RKHS. An RKHS has a unique reproducing kernel, and conversely, a function K describes at most one RKHS.

Vector-valued Kernels and associated RKHS. A positive vector-valued kernel in RN

on X X is a map Γ : X X RN N such that, for all n N, x1,...,xn X and a1,...,an C:

i,j=1 ai aj Γ(xi,xj)c, c 0, c RN. (6)

As in the scalar case, the positive vector-valued kernel is associated with a unique vectorvalued RKHS (VV-RKHS) denoted by G (see Theorem 1 in Micchelli and Pontil (2005)).

de la Concha, Vayatis and Kalogeratos

However, the conditions to characterize the associated functional spaces are diﬀerent in this case (Micchelli and Pontil, 2005; Carmeli et al., 2006), more precisely it is required:

1. For every c RN and x X: Γ(x, )c G.

2. For every f G: f, Γ(x, )c G = f(x), c .

In this case, the reproducing kernel function returns a matrix in RN N (instead of a scalar), and the elements of the associated Hilbert space are vector-valued functions f : X RN.

2.5 Graph functions and graph smoothness

A graph function ϑ : V Y assigns to each node of a graph an element of a given metric space Y. When Y = R , ϑ = (ϑ1,...,ϑN) RN is also called graph signal (Shuman et al., 2013). The smoothness of ϑ w.r.t. a graph is deﬁned as:

u ng(v) Wuv(ϑ(u) ϑ(v))2. (7)

For the needs of our discussion, we introduce a generalization of this notion for Y = H, formally for graph functions ϑ( ) = (ϑ1( ),...,ϑN( )) HN, which implies ϑv( ) H, v V :

S(ϑ( )) = X

u ng(v) Wuv ϑu( ) ϑv( ) 2 H . (8)

The lower S(ϑ( )) is, the smoother we say the function ϑ( ) is w.r.t. the graph. Smoothness formalizes the idea that two adjacent nodes u and v have similar functions ϑu( ), ϑv( ). Notice, for any x X, ϑ(x) = (ϑ1(x),...,ϑN(x)) RN gives a graph signal. Moreover, when H is a scalar RKHS with a bounded kernel, e.g. C > 0 s.t. supx X p

K(x,x) C < , the smoothness of the graph function ϑ( ) w.r.t. the norm H (Expr. 8) implies also the smoothness of the graph signal ϑ(x) in the classical sense (Expr. 7). More clearly, for x X:

(ϑu(x) ϑv(x))2 = | [ϑu( ) ϑv( )],K(x, ) H|2 (Reproducing property of H)

K(x, ) 2 H ϑu( ) ϑv( ) 2 H (Cauchy-Schwarz inequality)

C2 ϑu( ) ϑv( ) 2 H . (Kernel boundedness)

S(ϑ(x)) = X

u ng(v) Wuv(ϑu(x) ϑv(x))2

u ng(v) Wuv ϑu( ) ϑv( ) 2 H = C2S(ϑ( )). (10)

Further comments on what graph smoothness quantiﬁes in a Multitask Learning context are provided in Sec. 3B.

3. Graph-based Relative Unconstrained Least-Squares Importance Fitting (GRULSIF)

The proposed Collaborative LRE framework estimates jointly the N likelihood-ratios at the nodes of a graph (see Fig. 1), in a collaborative and distributed manner. We approximate

Collaborative likelihood-ratio estimation over graphs

each node s rα v with a function fv H. Note that rα = (r1,...,r N) deﬁnes a graph function, which we assume to be smooth over the graph (see Sec. 2.5). This essentially suggests that two adjacent nodes, u and v, generally exhibit similar likelihood-ratios, rα u and rα v . This in turn means that the learned models, fu and fv, will be close w.r.t. the norm H and, as a consequence, will have similar outputs fu(x) and fv(x) for an input point x X. Notice that, by the likelihood-ratio deﬁnition, this hypothesis is true when pv = qv, v, even if there is heterogeneity among the nodes (generally, pv = pu). The latter is the basis of our approach, taking inspiration from the RULSIF method of Yamada et al. (2011), hence named Graph-based Relative Unconstrained Least-Squares Importance Fitting (GRULSIF). Our estimation strategy capitalizes over the elements described in Sec. 2 as follows:

1. The variational representation of Theorem 1 will allow us to deﬁne a functional optimization problem at the node level, aiming to approximate the relative likelihood-ratio rα v , while requiring minimal hypotheses for {pv}v V and {qv}v V .

2. The concept of graph smoothness and VV-RKHS will encode the geometry of the problem, and will formalize a collaborative estimation procedure.

3. The properties of VV-RKHS, more precisely the Representer theorem, will provide the required elements to translate the optimization problem from a potentially inﬁnitedimensional space into a simple optimization problem in RL, where L is the total number of available observations from all the nodes. Moreover, this approach will lead to eﬃcient likelihood-ratio estimators, ˆfv, that can be evaluated at any point x X by just computing a dot product in RL.

This line of reasoning is general enough to be applicable to any φ-divergences to produce likelihood-ratio estimates that account for a graph structure. However, for the rest of the paper, we will focus on the Pearson s χ2-divergence. The main reason for using this speciﬁc φ-divergence is that collaborative LRE takes the form of an unconstrained penalized least-squares problem. Moreover, the likelihood-ratio estimates are the solution to a linear system. Leveraging these features, we can seamlessly adapt existing and eﬃcient optimization techniques tailored for penalized least-squares, and hence integrate a mature theoretical framework to gain insight into the properties of the estimators. Such advantages may not be oﬀered or be readily available for other φ-divergences.

A) Node-level relative likelihood-ratio estimation

At each node v, we approximate the relative likelihood-ratio rα v via the variational representation of the χ2-divergence. To explain the properties of the resulting surrogate cost function, we focus on estimating a single relative likelihood-ratio rα (i.e. N = 1) by the elements of a scalar RKHS H, equipped with a scalar reproducing kernel K and a feature map ϕ( ) (see Sec. 2.4). To formally deﬁne the problem, we require H to satisfy the following standard hypothesis of the Kernel Methods literature.

Assumption 1 (Kernel boundedness). The reproducing kernel map K( , ) is measurable and can be upper-bounded by a constant C > 0:

K(x,x) C < . (11)

de la Concha, Vayatis and Kalogeratos

This is satisﬁed by popular kernels, such as the Gaussian and the Laplacian kernels, and in general, for continuous kernel maps K( , ) deﬁned in a compact input space X. Theorem 2 states the variational representation of the χ2-divergence when F = H. The result is a direct consequence of Theorem 1, and its proof is provided in Appendix A.

Theorem 2 Under Assumption 1, the variational formulation of the χ2-divergence between P α and Q takes the form:

PE(P α Q) = Z (rα(y) 1)2

2 d P α(y) sup f H

Z f(x )d Q(x ) Z f2(y)

2 d P α(y) 1

where f is an approximation of the likelihood-ratio rα.

The optimization problem appearing in Expr. 12 can be interpreted as a least-squares approximation of rα. This becomes more evident in Problem 13 where we state the LRE as a least-square optimization problem in RKHS with respect to the norm of the space of square-integrable functions w.r.t. P α, denoted by L2(P α):

inf f H ˆL = inf f H

2 d P α(y) Z f(x )d Q(x )

2 d P α(y) Z f(y)rα(y)d P α(y)+ Z (rα(y))2

Z [f rα]2(y)

2 d P α(y),

where indicates that both optimization problems are equivalent up to some constant terms. The second equality is brought by Epα(y)[rα(y)2] = C1, where C1 < is a constant, and the change of measure identity Epα(y)[g(y)rα(y)] = Eq(x )[g(x )]. For Problem 13 to be well-deﬁned, the optimal solution f should exist and f H, as well as f should coincide with the relative likelihood-ratio rα we want to estimate. To ensure this, notice that Assumption 1 implies H L2:

Epα(y)[f2(y)] = Epα(y)[( f,K(y, ) H)2] (Representer property)

f 2 H Epα(y)[ K(y, ) 2 H] (Cauchy Schwarz inequality)

C2 f 2 H < . (Assumption 1)

For a well-deﬁned model, i.e. rα H, it is clear that f = rα is P α-almost everywhere. Moreover, it is worth to comment on whether this model speciﬁcation constitutes a restrictive hypothesis in the usual LRE context. Notice that the fact that rα is an upperbounded function implies rα L2(P α). Then, the elements of a RKHS H that is dense w.r.t. the L2(P α) norm can approximate as much as we want rα. Most LRE methods deal with the case X Rd, where it can be shown that translation invariant kernels (e.g. Gaussian, Laplacian, and the Mat ern class of kernels) are universal kernels that can approximate any function in L2(P α). Similar results can be obtained when X is a locally compact Hausdorﬀ space (Sriperumbudur et al., 2011). To conclude, the approximation properties of RKHSs along with the α-regularization imply that the proposed Problem 12 is a powerful functional optimization technique for approximating relative likelihood-ratios. Moreover, under the hypothesis rα H, the approximation error would be zero.

Collaborative likelihood-ratio estimation over graphs

B) Multitasking formulation of the LRE over graphs

Our goal is to estimate the vector-valued function rα = (rα 1 ,...,rα N). As in the previous section, we need a surrogate cost function leading to an optimization problem with nice numerical properties and a functional space with good approximation properties. Moreover, we would like to leverage pior knowledge about the similarities between the relative likelihood-ratios to obtain better generalization. Thus, the LRE problem based on χ2divergence, motivates a multitasking formulation of the relative likelihood-ratios estimation over a graph, through the following objective function:

argmin {fv}v V HN 1 N

2Epα v (y)[[rα v fv]2(y)] + λ

u,v V Wuv fu fv 2 H + λγ

= argmin {fv}v V HN 1 N

2Epα v (y)[f2 v (x)] Eqv(x )[fv(x)] + λ

u,v V Wuv fu fv 2 H + λγ

v V fv 2 H .

(14) The ﬁrst term of the objective function is a loss asking for a good approximation at each node; the second term introduces our hypothesis that adjacent nodes are expected to have similar likelihood-ratios; the third term is a penalization term aiming to reduce the risk of overﬁtting (Sheldon, 2008), where λ, γ > 0 are penalization coeﬃcients. Let us now deﬁne the vector-valued kernel:

Γ(x,x ) = K(x,x )(L+γIN) 1 RN N. (15)

Given the properties of the graph Laplacian L, it can be shown that Γ( , ) is a positive vector-valued kernel inducing a VV-RKHS G, in which the norm of any f G is deﬁned as:

u,v V Wuv fu fv 2 H +γ X

v V fv 2 H . (16)

Notice that the norm in G incorporates both the geometry induced by the structure of the graph Laplacian L and the geometry of the scalar RKHS H. As explained in Sec. 2.1, we assume that there is access to two samples, X and X . Then, the optimization Problem 14 can be written as a penalized empirical risk minimization problem in terms of the elements of G, the vector-valued functions f = (f1,...,f N):

min f G 1 N

i=1 f2 v (xv,i)+ α

i=1 f2 v (x v,i) 1

i=1 fv(x v,i)

2 f 2 G . (17)

The VV-RKHS formulation allows us to apply the Representer theorem (Theorem 5 in Micchelli and Pontil (2005)), meaning that the solution to Problem 17 takes the form:

i=1 Γ(xi, )ci =

i=1 K(xi, )(L+γIN) 1ci, (18)

where L = P v V (nv + n v) is the total number of observations in all nodes, and ci RN, i = 1,...,L. The second equality comes from Eq. 15. More speciﬁcally, the node-level ap-

de la Concha, Vayatis and Kalogeratos

proximation takes now the form:

i=1 K( ,xi)[(L+IN) 1]v,:ci =

i=1 K( ,xi)θv,i = ϕ( )Tθv, (19)

where, for the second equality we have deﬁned θv,i = [(L+IN) 1]v,:ci, we deﬁne the feature map w.r.t. all observations as the function ϕ : X RL, ϕ(x) = (K(x,x1),...,K(x,x L))T RL. The last equality uses θv = (θv,1,...,θv,L)T RL, which is an abuse of notation that is helpful for the presentation. By the deﬁnition of ˆfv( ), its norm in H can be elegantly written as:

ˆfv 2 H = θT v Kθv, (20)

where K RL L is the Gram matrix associated with the scalar kernel function K( , ). We conclude from Expr. 19 that approximating (rα 1 ,...,rα N) amounts to estimating the node parameters θv, for all v V .

Further remarks on graph smoothness. A clearer interpretation of graph smoothness in multitasking, can be given by developing the penalization term introduced in Expr. 16:

u,v V Wuv fu 2 H + fv 2 H 2 fv,fu H +γ X

u v Wuv fu 2 H + fv 2 H 2 fv,fu H +γ X

v V (dv +γ) fv 2 H 2 X

u v Wuv fv,fu H.

By hypothesis, we have Wuv 0, which implies that f 2 G decreases as the dot product fv,fu H of the functions associated with connected nodes increases. This dot product can be interpreted as a similarity measure between the functions fu,fv H w.r.t. the geometry of the underlying scalar RKHS. These observations imply that a vector-valued function f = (f1,...,f N) becomes smoother w.r.t. the graph as the functions associated to connected nodes become more similar w.r.t. the geometry of the RKHS. The exact meaning of the similarity measure fv,fu H depends on the chosen kernel. For example, consider the case of the input space X = Rd and a translation-invariant positive deﬁnite kernel k(x,t) = ψ(x t), where ψ and its Fourier transform ˆφ are integrable functions in Rd. In this setting, each element of the scalar RKHS is an integrable and continuous function f whose norm is given by:

f 2 H = 1 (2π)d

Rd | ˆf(w)|2

ˆφ(w) dw < + . (22)

Then, the similarity fv,fu H can be computed in terms of the Fourier transforms ˆfu, ˆfv:

fu,fv H = 1 2(π)d

ˆfu(w) ˆfv(w)

ˆφ(w) dw. (23)

For translation-invariant kernels, fv,fu H measures the similarity of both functions within the Fourier domain, weighted by 1 ˆφ(w).

Collaborative likelihood-ratio estimation over graphs

C) LRE as a quadratic problem in RNL

Let Θ = vec(θT 1 ,...,θT N)T RNL be the concatenation of all node parameter vectors in a vector. Let us also introduce the following terms associated with a speciﬁc feature map (here, this is ϕ( )), which need to be computed only once at the beginning:

x Xv ϕ(x)ϕ(x)T RL L, H v = 1

x X v ϕ(x )ϕ(x )T RL L, h v = 1

x X v ϕ(x ) RL,

H = block(H1,...,HN) RLN LN, H = block(H 1,...,H N) RLN LN,

h = vec(h 1,h 2,...,h N)T RLN. (24)

Above, block(H1,...,Hn) denotes a block diagonal matrix with each block corresponding to one of the square matrices H1,...,Hn, and vec(h 1,..,h n) denotes the concatenation of the input vectors h 1,...,h n in a single vector. By putting everything together and by using Eq. 20, 24, we conclude that the functional optimization Problem 14 can be restated as a quadratic problem w.r.t. the vector Θ:

min Θ RNL Φ(Θ) = min Θ RNL 1 N

2 θT v Hvθv + α

2 θT v H vθv h T vθv

u,v V Wuv(θv θu)TK(θv θu)+ λγ

v V θT v Kθv

= min Θ RNL 1 N

2 ΘTH Θ h TΘ

2 ΘT(IN K 1 2 )T [L IL](IN K 1 2 )Θ

2 ΘT(IN K)Θ

min Θ RNL Φ(Θ) = min Θ RNL ΘTAΘ 1

where is the Kronecker product of two matrices and:

2 (IN K 1 2 )T [L IL](IN K 1 2 )+ λγ

2 (IN K). (26)

Notice that A is a semi-positive deﬁnite matrix given that L and K are semi-positive deﬁnite as well, which implies that Problem 25 is a quadratic optimization problem in Θ. We will exploit this fact in Sec. 5 to propose an eﬃcient optimization procedure.

de la Concha, Vayatis and Kalogeratos

D) Pearson s χ2-divergence estimation

We can use the estimated likelihood-ratio ˆfv from Eq. 19, and Eq. 24 to approximate the following expectation that corresponds to the loss ℓv(θv) at node v:

1 2Epα v (y)[f2 v (y)] Eqv(x )[fv(x)] = 1 α

2 Epv(x)[f2 v (x)]+ α

2 Eqv(x )[f2 v (x )] Eqv(x )[fv(x )] (27)

2 ˆθT v Hv ˆθv + α

2 ˆθT v H v ˆθv h v Tˆθv =: ˆLv(ˆθv). (28)

This expression leads to the more compact and convenient formulation of Problem 25:

min Θ RNL 1 N

v V ˆLv(θv) + λ

2 ΘT(IN K 1 2 )T [L IL](IN K 1 2 )Θ+ λγ

2 ΘT(IN K)Θ. (29)

Moreover, we use Eq. 12 to propose an approximation of PE(pα v qv) based on the estimated parameters ˆΘ and the available samples Xv and X v:

ˆ PEα v (Xv X v) = ˆLv(ˆθv) 1

Problem 29 and Eq. 30 highlight how minimizing the former amounts to maximizing the estimated χ2-divergence, while at the same time accounting for the structure of the graph and the geometry of the RKHS H. Finally, let us deﬁne the following expression for f G:

PEα v (fv) = Eqv(x )[fv(x )] 1

2Epα v (y)[f2 v (y)] 1

2, v V. (31)

Notice that, as a consequence of Theorem 1, we have:

PEα v (rα v ) = PE(pα v qv) and PEα v (rα v ) PEα v (fv). (32)

4. Convergence guarantees

In this section, we discuss the generalization properties of GRULSIF, more precisely the gains brought by the Collaborative LRE when Pearson s χ2-divergence is used in the surrogate cost function. The main result of this section is summarized in Theorem 3. For the rest of the section, we assume nv = n v = n, i.e. that we have the same number of observations from pv and qv at each node v, and all the nodes have the same sample size. Moreover, we assume that observations come in pairs zv = (xv,x v) as realizations of a probabilistic model described by the joint pdf pz,v with marginal pdfs pv and qv. Let us start by deﬁning the functional space:

FG = {f = (f1,...,f N) G : 1

2 f 2 G Λ2}, (33)

Collaborative likelihood-ratio estimation over graphs

where Λ 0 is a positive constant controlling the smoothness of the vector-valued function to be learned w.r.t. the graph G and the Hilbert space H. The ﬁrst thing to notice is that Problem 17 can alternatively be written in terms of the functional space FG:

min f FG 1 N

i=1 f2 v (xv,i)+ α

i=1 f2 v (x v,i) 1

i=1 fv(x v,i)

Assumption 2 (Independent observations). {zv,i}v V,i=1,...n = {(xv,i,x v,i)}v V,i=1,...n represent n N pairs of independent observations: for u,v V and i,j {1,...,n}, zv,i = (xv,i,x v,i) is independent of zu,j = (xu,j,x u,j) if v = u or i = j. Moreover, for each node v V , the pairs {(xv,i,x v,i)}i=1,...,n are also identically distributed under the joint law pz,v with marginals pv and qv, where xv,i pv and x v,i qv.

The data independence assumption appears in previous LRE works that study a single data source (Nguyen et al., 2008, 2010; Sugiyama et al., 2012). The above graph-based variant is standard in theoretical analyses based in multitasking involving Vector-Valued Kernels (Maurer, 2006a; Youseﬁet al., 2018), yet it can be considered as being a strong hypothesis for many applications.

Assumption 3 (Well-speciﬁed model). There exists Λ > 0 such that rα = (rα 1 ,...,rα N) FG, where FG is deﬁned by Eq. 33.

Assumption 3 states that the proposed statistical model is well-deﬁned in the functional space. In particular, it implies: i) rv H, for all v V , a common hypothesis in the LRE literature (Nguyen et al., 2008, 2010; Sugiyama et al., 2012; Nguyen et al., 2024); ii) it introduces the parameter Λ, which relates to the regularization constant λ (Problem 25), and formalizes the a priori information encoded in the graph that is required to estimate the vector rα. Let ϕ(y) the feature map associated with the RKHS H, and let us consider g,h H, and deﬁne the operator g h : H H as g h(f) = f,h Hg. Then, we can deﬁne the covariance operator associated to the node v V as:

Σv = Epα v (y)[ϕ(y) ϕ(y)]. (35)

Assuming the feature space X is compact and the K is continuous, the Mercer s theorem implies (Aronszajn, 1950; Dieuleveut, 2017):

i=1 µv,i ϕv,i ϕv,i, (36)

where { ϕv,i}i N forms a Hilbert basis of H, with associated eigenvalues {µv,i}i N. Nevertheless, there exists more general settings where Expr. 36 is satisﬁed (see Dieuleveut (2017)).

Assumption 4 (Capacity condition.) For each v V , assume Σv satisﬁes Eq. 36. Denote by I the set of indexes of non-zero eigenvalues {µv,i}i I of the operator Σv arranged in decreasing order. We assume that µv,i s2 vi ζv, i I, for some ζv > 1 and some sv > 0.

de la Concha, Vayatis and Kalogeratos

The capacity condition quantiﬁes the size of the RKHS H w.r.t. the eigenbasis { ϕv,i}i I. Larger ζv values lead to faster eigenvalue decay, which means less basis functions are required to approximate H, hence rα v can be approximated by a smaller space. When ζv approaches 1, a bigger space will be needed to approximate the elements of H, including rα v . This assumption has been discussed and analyzed in previous works for obtaining optimal convergence rates in the context of Kernel Ridge regression (Caponnetto and Vito, 2006; Ying and Pontil, 2007; Steinwart et al., 2009).

Theorem 3 If Assumptions 2-4 are satisﬁed, and FG is a class of functions with ranges in [ b,b], b R+. Then, for any C 1 and δ (0,1), with probability at least 1 δ, the solution to Problem 34, ˆf = ( ˆf1,..., ˆf N), satisﬁes:

h PE(pα v qv) PEα v ( ˆfv) i 2C(202)B1ρ + 16B2 0C +24B0B1

v V Epα v (y) h [ ˆfv rα v ]2i 4C(202)B1ρ + 32B2 0C +48B0B1

where: Cα,v = min{ 1

α, rα v }, Cα = maxv V Cα,v,

2 h (b+Cα)2 +4Cα i , B1 = 1

2 (b+Cα)2 +(b+Cα),

ζ = min v V ζv (recall ζv > 1, v V ), smax = max v V sv,

TGΛ2(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

1 1+ζ max ,

TG = βγ 1 +(1 β)λ 1 min+, (38)

where TG encodes the graph topology, λmin+ denotes the smallest nonzero eigenvalue of L, and β = #C(G)

N [0,1] is the ratio between the number of connected components of G and the total number of nodes.

The proof is given in Appendix C and it relies mainly on the framework of Local Rademacher Complexities for Multitask Learning introduced in Youseﬁet al. (2018). Notice that the convergence rates are given in terms of the excess risk and the average L2(P α) distance between ˆfv and rα v w.r.t. the measure pα v . The excess risk takes the form of the diﬀerence between the expected divergence PEα v ( ˆfv) and the true χ2-divergence PE(pα v qv) that former aims to approximate.

The convergence rates depend mainly on the number of observations per node (n

the number of nodes in the graph (N 1 1+ζ ), the smoothness of the function to be approx-

imated (Λ 2 1+ζ ), the topology of G (T

1 1+ζ G ), and the eﬀective dimension of the space to approximate each rα v , which is encoded by ζ and smax. When ζ is small and close to 1, the convergence rate can be as slow as O smax TGΛ

, and as fast as O 1

n when ζv for all v V . This means that the gains of the collaborative estimation in terms of the

Collaborative likelihood-ratio estimation over graphs

excess risk will be more relevant as ζ is smaller, since the number of nodes, smoothness, and graph topology play a role in the convergence rate. This situation occurs when a larger number of basis functions { ϕv,i}i N are required to approximate the space H, which could mean that rα v is harder to estimate with respect to w.r.t. the kernel function K and the data distribution. In that case, a larger number of nodes and a smoother rα over the graph would improve performance. However, the collaborative estimation will oﬀer little advantage when the H is low-dimensional (large values of ζ ), since convergence is governed by the number of observations per node. This suggests that GRULSIF can be used in the regime in which multitasking is also recommended: when there are many interrelated tasks with little data per task, and each of the tasks is complex to be solved using only its available local data (Youseﬁet al., 2018; Zhang and Yang, 2021).

Example of collaborative vs independent LRE. To illustrate the gains collaboration may bring, consider the special case where all the relative likelihood-ratios are the same, with norms that are upper-bounded by a constant M > 0. Let G be a graph with weight matrix W c parametrized by a constant, such as c 0 as W c uv = c N , for all u = v V , and W c uu = 0 for all u V . Notice that W c=0 = 0N N implies no collaboration at all (N connected components), and W c>0 induces a fully connected graph without self-loops (a single connected component). For c > 0, the smallest non-zero eigenvalue of the graph Laplacian is λmin+ = c. Importantly, in both cases the norm rα 2 G remains the same. In fact, if we choose γ = 1

rα 2 G = γ X

v V rα v 2 H 1

Thus, to satisfy Assumption 3, it suﬃces to take Λ2 = M in both cases. Let us now compare the two approaches in terms of Theorem 3. In the case of independent LRE (c = 0), the leading term of Theorem 3 is upper-bounded as follows:

(ρ )c=0 8B0

M(b+Ca)2ζ 1 1+ζ n ζ

1 1+ζ max .

On the other hand, in the case of full collaboration (c > 0), the bound becomes respectively:

(ρ )c>0 8B0

M(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

M(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

1 1+ζ max . (If c 1

The main diﬀerence between those upper-bounds is that the term N 1 1+ζ disappears in the independent setting. In other words, the convergence rates depend only on the number of observations per node n, and no eﬀect from having multiple nodes. Contrary, the collaborative approach beneﬁts from enforcing the prior information regarding the similarity between relative likelihood-ratios (larger values of c), and also from the number of nodes. The beneﬁt of collaboration is mediated by the complexity of the problem encoded in the parameter ζ .

de la Concha, Vayatis and Kalogeratos

The case α = 0. The convergence rates given in Theorem 3 still apply when the αregularization is ignored, provided that Q P and for each v V there exists Cv > 0 such that rv Cv < . Although this hypothesis is quite standard in the literature (Nguyen et al., 2010; Sugiyama et al., 2011a; Kanamori et al., 2011; Nguyen et al., 2024), it is restrictive and hard to verify in the general LRE setting in the absence of prior knowledge about P and Q. In practice, if we set α = 0 and this hypothesis is not satisﬁed, we may face situations where LRE approximates a function that either does not exist or is unbounded, which may hinder the numerical performance. The impact on the convergence rates is visible in Theorem 3. If α = 0, with the convention 1

0 = , then Cα = maxv V rv , meaning that the generalization bounds scale with the norms { rv }v V , which are unknown and potentially very large. α-regularization prevents the likelihood-ratios from exploding and provides some control over the convergence rates via the hyperparameter α. Of course, there may be situations where Qv Pv, v V , and maxv V r 1

α; in this case Theorem 3 suggests that regularization is unnecessary. However, since such a condition cannot be veriﬁed in the general problem setting, we prefer to ﬁx the same α for all the nodes to avoid additional hypotheses on {Qv}v V and {Pv}v V .

The case Qv = Pv, v V . Here, the likelihood-ratios are well-deﬁned: rv(x) = rα v (x) = 1, v V , for any 0 < α < 1. Furthermore, it easy to verify that 1 = Cα=0 = Cα =0 = 1. Therefore, under the assumptions stated at the beginning of this section and Theorem 3, the convergence behavior would be independent of the regularization in this case.

5. Practical implementation

A straightforward optimization of Problem 25 would set the derivative of the objective function to zero, i.e. ΘΦ(Θ) = 0, and solve to get the estimated parameters ˆΘ:

N A h , (39)

where A denotes the pseudoinverse of ANL NL. Nevertheless, the size of matrix A scales with the number of nodes in the graph (N) and the number of available observations (L). The total complexity of this optimization approach would be of scale O((LN)3), which makes it prohibitive to compute in most practical situations. For deploying GRULSIF in practice, we propose in this section an optimization procedure that can handle eﬃciently large graphs and a substantial number of observations. Additionally, we detail the strategy to identify the regularization constants λ, γ > 0, and the hyperparameters related to the kernel that we will denote by σ, when K( , ) is the Gaussian kernel σ is the width parameter.

5.1 Computing the node parameter updates via CBCGD

Instead of computing the peudoinverse A to solve Problem 25, we propose to use the Cyclic Block Coordinate Gradient Descent (CBCGD) method (Beck and Tetruashvili, 2013; Li et al., 2018). Our optimization schema operates in cycles, and each cycle involves multiple iterations of a block coordinate gradient descent (GD) one for each node v; therefore, the high-level complexity is O(#Cycles #Nodes Cost of GD at one node). Starting from the

Collaborative likelihood-ratio estimation over graphs

last term, CBCGD s i-th cycle has to estimate the node parameter ˆθ(i) v at each node v:

ˆθ(i) v = (λγK+ηv IL) 1 " component depending on node v z }| {

ηv ˆθ(i 1) v 1 α

ˆθ(i 1) v h v N

component depending on the graph z }| {

λK dv ˆθ(i 1) v X

u ng(v) Wuv 1{u < v}ˆθ(i) u +1{u v}ˆθ(i 1) u #

where ηv is the node learning rate, recall that dv is the node degree and 1{ } is the indicator matrix. Notice the elegance of the decomposition of the update into two components: one depending on the node v itself, and the other depending on the graph, i.e. only on v s neighbors. Important to note that, the node parameters are estimated asynchronously in each cycle in an arbitrary but ﬁxed cyclic order; this is clear in the summation inside the graphrelated component. Tweaking this order to adapt it to speciﬁc communication restrictions between nodes is possible, but it is left to the reader to specify the most convenient setting for her needs (see Wright (2015) for a review of the topic). CBCGD is easy to implement, and when applied to quadratic problems leads to a manageable complexity in terms of the number of cycles required to achieve convergence (Li et al., 2018). This kind of result is made explicit for the optimization schema described in Expr. 40 in the following theorem.

Theorem 4 Suppose that for a dictionary D of size L 2 we desire to solve the optimization Problem 25 via the CBCGD strategy, where the update w.r.t. the node parameter θv at the i-th cycle is computed as detailed in Eq. 40. Then, if we ﬁx the learning rate for node v to be equal to the maximum eigenvalue ηv = emax (1 α)

N H v +λdv K , we will need at most the following number of cycles for achieving a pre-speciﬁed accuracy level ϵ > 0:

& λγc(Cmin +λγc)+16C2 log2(3NL)

λγc(Cmin +λγc) log 1

Φ(Θ(0)) Φ(Θ ) '

where Φ(Θ) is the cost function of Expr. 25, c > 0 is a positive constant, Cmin = minv V Cv:

N H +λ(IN K 1 2 )[L IL](IN K 1 2 ) ,

N H v +λdv K . (42)

The proof of Theorem 4 is provided in Appendix A.3. The computational complexity of the full optimization schema depends on two components: 1) estimating the optimal learning rates ηv and the inversion of the matrix λγK + ηv IL, operations to be done just once for each of the nodes, this step amounts to a computational cost of O(NL3). 2) The cost of Eq. 40 across all nodes and cycles. The cost for a node v at a given cycle i is dominated by matrix-vector multiplications of dimension L, leading to a cycle cost of O(NL2). As indicated by Eq. 25, the required number of CBCGD cycles for achieving a given accuracy level ϵ scales in O(log2(NL)). The total cost of the second step is then O(NL2 log2(NL)). The total cost of the whole optimization schema is then O(NL3 +NL2 log2(NL)).

de la Concha, Vayatis and Kalogeratos

5.2 Nystr om dimensionality reduction strategy

The main computation burden of the CBCGD method is related to the dataset size, that is the number of observations L. This is a common problem in Kernel Methods and has motivated extensive research. For instance, the random features approach (Rahimi and Recht, 2007) uses a randomized feature map to approximate the input space by a low dimensional Euclidean space. Low-rank approximations, such as Nystr om approximations (Williams and Seeger, 2000; Smola and Sch okopf, 2000), use a subsample of observations as a dictionary, to deﬁne a ﬁnite dimensional space that preserves the approximation properties of the original space. In time-series analysis, Richard et al. (2009) proposed to grow the dictionary by adding new elements one-by-one, according to a coherence threshold that keeps the linear dependency of the dictionary elements as low as possible (i.e. as diverse as possible basis functions), while still being able to approximate any of the functions in H. The usual approach followed in non-parametric LRE is simply to create a dictionary out of a subsample of the observations chosen uniformly at random (Sugiyama et al., 2012). In general, the choice of the dictionary learning method depends on the task, the time complexity requirements, and the nature of the chosen kernel. Nystr om approximations replace the feature map ϕ(x) by its orthogonal projection into a ﬁnite-dimensional space F = span({ϕ(x) : x DˆL}), where DˆL = {xi X}ˆL i=1 is a set of carefully chosen points in the original input space (not restricted to data observations), and span( ) refers to the set of lineal combinations of the input elements, for some chosen ˆL L. The points ϕ(x1),...,ϕ(xˆL) are known as anchor points in H, and, via the associated kernel matrix KˆL RˆL ˆL, [KˆL]ij = K(xi,xj), they allow the deﬁnition of a new feature map:

2 ˆL K( ,x1),..,K( ,xˆL) T , (43)

The idea is to choose the anchor points such that preserve the geometry of H, i.e. the dot product in the inﬁnite dimensional space H gets translated into a dot product in RˆL:

K(x,y) = ϕ(x), ϕ(y) H ψ(x), ψ(y) , x,y X.

According to the empirical risk minimization and the Representer Theorem (Expr. 19), the node-level approximation in this new space takes the form:

i=1 K(x,xi)θv,i = D L X

i=1 ϕ(xi)θv,i, ϕ(x) E

H wv, ψ(x) , (44)

where wv RˆL. This approximation can rephrase Problem 25 in terms of vectors in RˆL and the new feature map ψ( ):

min Θ RN ˆL 1 N

2 θT v Hψ,vθv + α

2 θT v H ψ,vθv (h ψ,v)Tθv

u,v V Wuv θv θu 2 + λγ

(45) where refers to the Euclidean norm, and the terms Hψ,v, H ψ,v RˆL ˆL, and h ψ,v RˆL

are those of Eq. 24, but now computed using their new associated feature map ψ( ). Recall that these terms need to be computed only once at the beginning. Moreover, Nystr om

Collaborative likelihood-ratio estimation over graphs

approximation does not aﬀect the structure of the problem, which remains quadratic and can be solved via CBCGD, with each iteration taking the form:

ˆθ(i) v = 1 λγ +ηv

" component depending on node v z }| {

ηv ˆθ(i 1) v 1 α

ˆθ(i 1) v 1

component depending on the graph z }| {

λ dv ˆθ(i 1) v X

u V Wuv 1{u < v}ˆθ(i) u +1{u v}ˆθ(i 1) u #

The ﬁnal computational cost is the sum of the cost of encoding the data points via Expr. 43 and the optimization procedure. The encoding requires a matrix inversion O(ˆL3) and L matrix-vector multiplications of dimension ˆL (this is overall O(LˆL2), while CBCGD requires the estimation of the optimal learning rates ηv and has total cost O(N ˆL3), and the cost of all node iterates across cycles that amounts to O(N ˆL2 log2(N ˆL)). In conclusion, Nystr om approximation enables the reduction of the computation complexity from O(NL3+ NL2 log2(NL)) to O(N ˆL3 +LˆL2 +N ˆL2 log2(N ˆL)), where ˆL L. Nystr om approximation not only oﬀers computational gains, but it also brings interesting features from a data accessibility perspective: notice that computing the node-level quantities Hv, H v, h v requires access to the full dataset (Expr. 24), while Hψ,v, H ψ,v, h ψ,v requires only the anchor points and the available samples at that node (Xv, X v) (Expr. 43). In this problem formulation, the update of the vector parameter θv of node v only requires the computation of ψ( ) using node s own local observations, and the use of the parameters of its neighbors {θu}u ng(v). In conclusion, Nystr om approximation combined with our optimization schema enables a distributed LRE at each node and, hence , limits to only indirect node access to foreign data of other nodes through the parameters θu for u ng(v). This is appealing for applications with data access restrictions. The remaining important question is how to select the set of anchor points. There are many strategies to address this problem; for example, Kernel PCA (Sch olkopf et al., 1998), random sampling (Williams and Seeger, 2000; Talwalkar et al., 2008), greedy approaches (Bach and Jordan, 2002), or k-means clustering (Zhang et al., 2008). In this work we use the approach proposed by Richard et al. (2009) that is based on the coherence measure. That algorithm builds a dictionary of manageable size and low redundancy, has a low computational cost, and, under mild conditions, it produces good approximations of the whole space. The adaptation of this strategy in our context can be found in Appendix A.3.

5.3 POOL: a no graph variant

One important by-product of the GRULSIF framework and our supporting analysis, is that we can derive a reduced LRE variant, which we call POOL, that we call that disregards the graph, while enjoying all the other advantages of our non-parametric optimization formulation. By setting W = 0N N, which neutralizes the graph component and hence the associated terms disappear from Eq. 46, we get POOL s optimization problem:

min Θ RN ˆL 1 N

2 θT v Hψ,vθv + α

2 θT v H ψ,vθv (h ψ,v)Tθv

v V θv 2. (47)

de la Concha, Vayatis and Kalogeratos

This yields N independent quadratic problems admitting a closed form solution:

N (1 α)Hψ,v +αH ψ,v +λγIˆL

1 h ψ,v. (48)

POOL leads to a total computational complexity of O(N ˆL3 +LˆL2) (the term related with the cost of the CBCGD schema disappears). POOL can be relevant when it is believed that there is no graph behind the observed phenomena at the diﬀerent locations, or in situations like those detailed in Sec. 4 where the collaborative estimation may oﬀer little advantage. Moreover, POOL can be seen as a RULSIF variant (Yamada et al., 2011), where POOL s main diﬀerences are: i) its hyperparameters are selected jointly w.r.t. to the mean score 1 N P v V ℓv(θv), while RULSIF selects independently the hyperparameters for each task; ii) POOL uses the Nystr om dimensionality reduction technique over the full set of observations, while RULSIF uses a simple uniform random sampling at each node (Sugiyama et al., 2012).

5.4 Hyperparameter selection

The performance of GRULSIF depends on the penalization constants γ, λ, and the hyperparameters of the kernel K (e.g. for a Gaussian kernel, that would be only the width σ). As in previous works in non-parametric φ-divergence estimation, we use a cross-validation strategy (Sugiyama et al., 2007, 2011a; Yamada et al., 2011). The main diﬀerence is that in GRULSIF we aim to minimize the average of the cost function over all the nodes of the graph. Thus, the score used to identify the optimal hyperparameters is:

ˆL(Θ,σ) = 1

v V ˆLv(θv,σ) = 1

2 θT v Hψ,v(σ)θv + α

2 θT v H ψ,v(σ)θv h ψ,v(σ)Tθv

(49) where Hψ,v(σ), H ψ,v(σ)(σ), h ψ,v(σ) explicit the relationship between these operators and the hyperparameters of the kernel function K. At each iteration of the cross-validation, we deﬁne two training sets, Xtrain, X train, to update Hψ,v(σ), H ψ,v(σ)(σ), h ψ,v(σ). We ﬁx the two hyperparameters λ and γ to estimate the parameter ˆΘ(σ,λ,γ), which is the solution to the optimization problem:

ˆΘ(σ,γ,λ) = argmin Θ

v V ˆLv(θ,σ)+ λ

u,v V Wuv θv θu 2 + γ

v V θv 2 . (50)

The solution of Problem 50 is found via Alg. 2. Finally, the parameter ˆΘ(σ,γ,λ) is used to identify which combination of parameter values σ ,λ ,γ are optimal such that they minimize the expected value of the chosen score ˆL(ˆΘ(σ,γ,λ),σ) of Eq. 49. The implementation details of the model selection are provided in Alg. 1. We can apply a similar approach to ﬁnd the hyperparameters of POOL. As POOL ignores the graph structure (W = 0N N), we ﬁx λ = 1, hence the penalization term related to the norm of each functional fv will only depend on γ (Eq. 47). Then, we use crossvalidation to identify the optimal values of the hyperparameters σ and γ. The score to decide for this choice is the same as one described in Eq. 49. Once the anchor points of Nystr om approximation and the hyperparameters have been ﬁxed, we can learn the relative likelihood-ratios. For GRULSIF this amounts to estimating

Collaborative likelihood-ratio estimation over graphs

Algorithm 1 Model selection for GRULSIF hyperparameters tuning

1: Input: X,X : the two sets of observations to be used for estimating the likelihood-ratios; 2: G = (V,E,W): a given graph; 3: #σ,#λ,#γ: parameter grid to explore for values of σ,λ,γ; 4: R: the number of random splits. 5: Output: σ the optimal scale parameter for the Gaussian kernel, the associated dictionary Dσ , and the two penalization constants λ and γ .

6: Randomly split X and X into R disjoint subsets {Xr}R r=1 and {X r}R r=1 7: for each σ #σ do 8: Compute a dictionary Dσ using the chosen kernel and hyperparameter σ 9: for each (λ,γ) #λ #γ do 10: for each data subset r = 1,...,R do 11: Let X train = X \X r, X test = X r, and Xtrain = X\Xr, Xtest = Xr 12: Compute h train(σ) and H train(σ) using the observations in X train (see Eq. 24) 13: Compute Htrain(σ) using the observations in Xtrain (see Eq. 24) 14: Find ˆΘ(σ,γ,λ) by solving 15: Compute h test(σ) and H test(σ) using the observations in X test 16: Compute Htest(σ) using the observations in Xtest 17: Compute ˆL(r)( ˆΘ(σ,γ,λ)) = 1

N P v V ˆLv(ˆθv(σ,γ,λ)) using h test(σ), H test(σ), and Htest(σ) 18: end for 19: Compute ˆL(σ,λ,γ) = 1

R PR r=1 ˆL(r)(Θ(σ,γ,λ)) 20: end for 21: end for 22: γ = argminσ,λ,γ ˆℓ(σ,λ,γ) 23: return σ , Dσ , λ , γ

the parameter ˆΘ(σ ,λ ,γ ) via Alg. 2, while POOL estimates ˆΘ(σ ,λ = 1,γ ) by solving the N independent quadratic problems of Eq. 48. The hyperparameter α requires a more complex discussion. On one hand, it depends on the LRE application: it is clear that when α = 1, the relative likelihood-ratio rα(x) = q(x) (1 α)p(x)+αq(x) = 1 and the χ2-divergence P α(p q) = 0, independently to p and q. That would make them meaningless as quantities for quantifying the diﬀerence between p and q. On the other extremity, if α = 0, we recover the classical likelihood-ratio rα=0(x) = r(x) = q(x) p(x). As we mentioned in Sec. 2.3, in this case, r( ) may be an unbounded function and cause problems of convergence or numerical instability (Yamada et al., 2011). Such phenomena are visible in the rates provided in Theorem 3, where the constants depending on α become undeﬁned. The role of α is to prevent this from happening, since it upper-bounds rα:

rα(x) = (1 α)p(x)+αq(x)

1 = 1 (1 α)r 1(x)+α 1

Therefore, at the level of estimating a single likelihood-ratio, the best way to see α is that it smooths the denominator of rα, which is the reference measure to compare p with q. In that sense, values of α > 0 that are far from 1 help in deﬁning meaningful and sensitive estimators. In addition, α has an eﬀect at the graph level: higher values make all estimates getting closer to 1 and consequently become more similar to each other (i.e.

de la Concha, Vayatis and Kalogeratos

Algorithm 2 GRULSIF: Collaborative and distributed LRE over a graph

1: Input: X,X : two samples with observations over the nodes of a graph graph G = (V,E,W); 2: α [0,1): parameter of the relative likelihood-ratio (Eq. 2); 3: σ, D: kernel hyperparameter, and a dictionary containing precomputed set of ˆL anchor points associated with a kernel K : X X R; 4: λ, γ: constants multiplying the penalization terms; 5: ˆΘ(0), tol : initialization of node parameters, and tolerated relative error before termination. 6: Output: estimated parameters ˆΘ = vec(ˆθ1,..., ˆθN).

7: for each node v {1,...,N} do 8: Compute Hψ,v, H ψ,v, and h ψ,v (see Eq. 24, using ψ instead of φ)

9: Compute the learning rate by ηv = emax 1 α

N H ψ,v +λdv IˆL (see Theorem 4)

10: end for 11: i = 0 12: repeat 13: i = i+1 14: for each node v {1,...,N} do

15: Update the node parameter ˆθ(i) v (see Eq. 46 and Sec. 5.2) 16: end for

17: until ˆΘ(i) ˆΘ(i 1)

ˆΘ(i 1) > tol

18: return ˆΘ(i)

stronger higher graph smoothness), which impacts the convergence rates of Collaborative LRE (see Theorem 3). To summarize, the optimal α value depends on the interplay between the convergence rates of the used estimation method and the performance achieved in the task we intend to address with the estimated likelihood-ratios. We investigate empirically the ﬁrst point in dedicated experiments in the next section.

6. Experiments

The empirical evaluation of the GRULSIF framework is conducted for the objective of estimating the likelihood-ratio rα v for each node of a given ﬁxed graph. In Sec. 6.1, we present synthetic experiments where the true likelihood-ratios are known by their design. The evaluation in real problems is challenging since the true likelihood-ratios are generally not known, however, in Sec. 6.2 we design a particular setting for seismic data where those true quantities can be safely assumed. In all experiments, both GRULSIF and POOL follow the numerical implementation guidelines described in Sec. 5, which include the CBCGD optimization technique and the Nystr om approximation of the RKHS.

6.1 Synthetic experiments

Scenarios. Each instance of a synthetic experiment is generated by the three stages below.

1. Graph structure. A random graph is generated according to a standard model:

A Stochastic Block Model (SBM) with 4 clusters, each containing 25 nodes (intracluster edge probability: 0.5; inter-cluster edge probability: 0.01).

A Grid graph model with 100 nodes arranged in 10 rows and 10 columns.

Collaborative likelihood-ratio estimation over graphs

Table 1: Synthetic scenarios. The scenarios are deﬁned by the graph structure they employ, the nodelevel distributions (pv and qv) generating the data observations at each node, and the location where changes take place in the graph. When the distributions or their parameters remain unchanged between pv and qv, this is indicated by .

Node-level hypotheses

Experiment X Graph Location pv vs. qv

Synth.Ia R1 SBM 4 clusters, 25 nodes each

v C1 N(µ = 0, σ = 1) vs. Uniform(

3) v C2 C3 N(µ = 0, σ = 1) vs. v C4 N(µ = 0, σ = 1) vs. N(µ = 1, σ = )

Synth.Ib R2 SBM 4 clusters, 25 nodes each

v C1 C2 N(µ = (0,0)T, Σ1,2 = 4

5) vs. v C3 N(µ = (0,0)T, Σ1,2 = 4 5) vs. N(µ = , Σ1,2 = 0) v C4 N(µ = (0,0)T, Σ1,2 = 0) vs. N(µ = (1,1)T, Σ1,2 = )

Synth.IIa R2 Grid 100 nodes

v V N(µ = (0,0)T, Σ = I) vs. N(µ = (r,c)T, Σ0,0 = 1+|r|, Σ1,1 = 1+|c|, Σ1,2 = Σ2,1 = 0) r = 2[(#row 5)/max#row (#row 5)] c = 2[(#col 5)/max#col (#col 5)]

Synth.IIb R4 Grid 4 quadrants, 25 nodes each

v C2 C3 C4 N(µ = 04, Σ = I4) vs. N(µ = 04, Σi,i = , Σi,i+1 = 0.8, Σi+1,i = 0.8)

Synth.IIc R20 Grid 4 quadrants, 25 nodes each

v C2 C3 C4 N(µ = 020, Σ = I20) vs. N(µ = 020, Σi,i = , Σi,i+1 = 0.8, Σi+1,i = 0.8)

2. Nodes behavior. A scheme is considered that ﬁrst speciﬁes if a node v shall experience a change of measure or not (pv = qv vs. pv = qv), and then associates speciﬁc pdfs to it. This is a critical design feature since, for the Collaborative LRE to be meaningful, nodes behavior (expressed as likelihood-ratios) should be explainable by the graph. In each scenario, one of the following two schemes is used:

Cluster-based scheme: All nodes in a cluster exhibit the same behavior. It is used for SBM graphs that have inherent cluster structure. Clusters are denoted by C1,C2,... .

Change in most of the graph: The four quadrants of a Grid graph are considered as being separate clusters of nodes. The nodes of the ﬁrst quadrant (C1) remain unchanged, while for the nodes in the other quadrants the change is in the covariance matrix, speciﬁcally the oﬀ-diagonal elements increase from 0 to 0.8.

Smooth graph variation: All nodes in a graph experience a change in both their mean vectors and their covariance matrices. The magnitude of a change get larger as the node is more distant to the center of the grid, therefore the distribution qv is diﬀerent for each node. This scheme induces smooth node changes across a strong spatial structure, without however corresponding to a multi-cluster structure.

3. Data observations. Finally, for each node v, an equal number of nv = n v = n (i.e. same for all nodes) data observations are generated from each associated pv and qv.

In each scenario summarized in Tab. 1, the observations have the same dimensionality for all nodes, 1-, 2-, 4or 20, as our framework requires for nodes to have the same input space X and RKHS. The scenarios are designed to pose various challenges. In some cases pv and qv are diﬀerent probability models, in others they are the same model with diﬀerent parametrizations. Moreover, there can be more than one type of change in a scenario; e.g. in Synth.Ia, all pv s are the same Normal distribution, the cluster C2 remains unchanged, while C1 becomes a Uniform with the same ﬁrst two moments of a standard Normal distribution, and C3 changes its mean. In Synth.IIa, the change of measure is diﬀerent for each node.

de la Concha, Vayatis and Kalogeratos

Table 2: LRE competitors. All the methods included in our experimental evaluation.

Method Reference Target function φ-divergence Graph

KLIEP Sugiyama et al. (2007) l.-r. KL-divergence No ULSIF Sugiyama et al. (2011a) l.-r. χ2-divergence No RULSIF Yamada et al. (2011) relative l.-r. χ2-divergence No

POOL this work (Sec. 5.3) relative l.-r. χ2-divergence No GRULSIF this work relative l.-r. χ2-divergence Yes

Synth.IIb and Synth.IIc aim to illustrate the impact of dimensionality to the performance of GRULSIF, similar experiments were originally presented by Rhodes et al. (2020).

Compared LRE methods. We compare against existing Kernel-based LRE methods built upon an φ-divergence variational formulation, namely ULSIF (Sugiyama et al., 2011a), RULSIF (Yamada et al., 2011), and KLIEP (Sugiyama et al., 2007). POOL, RULSIF, and ULSIF rely on the χ2-divergence; the ﬁrst two use the relative likelihood-ratio (Eq. 2), and ULSIF uses the classical deﬁnition (equiv. to when α = 0). KLIEP uses the KL-divergence. We also include POOL (Sec. 5.3) relies on the proposed optimization scheme and the same Nystr om approximation as GRULSIF, but disregards the graph. Tab. 2 summarizes the compared methods, while their hyperparameter selection is discussed in Appendix B.1.

Evaluation measures. The variational formulation of φ-divergences, described in Sec. 2.3, establishes the connection between estimating the likelihood-ratio between two probability measures and quantifying their dissimilarity via a φ-divergence (see Theorem 1). All LRE methods listed in Tab. 2 leverage this relation. In the graph-based extension presented in this paper, an LRE method produces node-level likelihood-ratio estimates, which can approximate the φ-divergence between the node-level probability measures pv and qv. The connection between both approximations derives from a similar approach to the one described in Sec. 3. Therefore, when knowing pv and qv, we can compare the performance of LRE methods along two dimensions: how accurately they approximate the target true node-level likelihood-ratio functions, and how eﬀectively they quantify the true φ-divergence between the corresponding probability measures. The approximation to the true node-level likelihood-ratios is quantiﬁed by the average node-level Mean Squared Error (MSE):

P α [f rα]2 = 1

v V Epα v (y)[[rα v ˆfv]2(y)]

v V (1 α)Epv(x)[[rα v ˆfv]2(x)]+αEqv(x )[[rα v ˆfv]2(x )], (51)

where the expected value Epα(y)[[fv rα v ]2(y)] is computed by averaging 10,000 independent samples {(xi,x i)}10000 i=1 that were not used during the training phase. α equals 0 for the LRE methods whose target function is the usual likelihood-ratio, and diﬀerent from zero for methods which target the relative likelihood-ratio (See Tab. 2). The MSE of Eq. 51 refers to the whole graph and it is the quantity in terms of which the convergence guarantees of GRULSIF (Theorem 3) and the theoretical results of Sec. 4 are given. Therefore, measuring the MSE can validate empirically those results. However,

Collaborative likelihood-ratio estimation over graphs

for some applications, such as Hypothesis Testing and Change-Point Detection, the interest may be whether the graph is beneﬁcial when comparing pv and qv. In such cases, the nodelevel φ-divergence estimates would be more informative, thus we also compare the node-level approximations with the true φ-divergence at each node. This information is summarized by box-plots and heatmaps in Fig. 2 16. The choice of the φ-divergence depends on the LRE method being used; details are given in Tab. 2 and in the self-contained ﬁgure captions.

Results and ﬁndings. The ﬁrst batch of results for the designed scenarios are in Fig. 2-6. Two line-plots at the top of each ﬁgure show the convergence of the methods, in terms of the logarithm of the MSE, as a function of the sample size n, for a graph with N = 100 nodes. Recall that the sample size refers to the equal number of available observations from each of the pdfs of a node v, i.e. nv = n v = n from pv and qv, respectively. The line-plot on the right zooms into a smaller range of the y-axis to compare GRULSIF, POOL, and RULSIF, therefore to investigate the impact of using the graph. The displayed error band is for one standard deviation computed over 10 instances. Below the line-plots there is a grid of box-plots (sample size method) for n = {50,100,250,500}. Each box-plot shows the φ-divergence estimates within meaningful node groups for each scenario, i.e. node clusters C1 to C4, or the subset C(u) and its complement C(u) . In these synthetic experiments, the true φ-divergence is identical for all nodes belonging to the same cluster, and the dashed lines indicate this value. In Fig. 4, where qv varies according to the position of v in a Grid graph, the box-plots are replaced by heatmaps. The ﬁrst row shows the true node-level φ-divergence, and the subsequent rows show the approximations for varying sample sizes. In most of the experiments, POOL and GRULSIF show superior performance to the other methods. Even though POOL disregards the graph, same as ULSIF, RULSIF, and KLIEP do, it still shows better convergence behavior than those methods. Evidently, the introduction of a global non-redundant dictionary, the Nystr om approximation, and the joint hyperparameter selection that we propose, boost the LRE performance when multiple sources of information are available and are all approximated by the same RKHS. Concerning the eﬀect of using the graph structure, when the smoothness hypothesis is satisﬁed and the likelihood-ratios are smooth over the graph, GRULSIF achieves a consistently better convergence and estimation quality compared to POOL, indicating the importance of the geometry of the problem. The advantage of GRULSIF becomes more clear as the sample size (n) at each node is smaller, which is expected as node-level tasks become more challenging. We see that GRULSIF s estimates have lower bias compared to the other methods, especially for nodes where pv = qv. This bias gets smaller as the sample size increases. However, as expected, collaboration may bring smaller gains when the available data at each node are suﬃcient to estimate the LRE independently, e.g. in Synth.Ia and Synth.Ib where POOL and GRULSIF have similar performance for n = n = 500 (Fig. 3). For a thorough discussion, see Sec. 4. The impact of the dimensionality (d) of the input space to Kernel-based LRE methods can be investigated by comparing Fig. 5 (Synth.IIb, d = 4) and Fig. 6 (Synth.IIc, d = 20). The box-plots show that the bias of all methods increases as d grows, with GRULSIF and POOL showing a smaller bias. The convergence of GRULSIF and POOL with the sample size is slower in higher dimensions due to requiring larger dictionaries. This is expected as the size of the dictionary built by the strategy we employ (Richard et al., 2009) depends on the covering number of the RKHS. For Gaussian kernels, the covering number scales with

de la Concha, Vayatis and Kalogeratos

d. This may explain why RULSIF, ULSIF, and KLIEP, which ﬁx the size of the dictionary independently of the kernel, show a higher bias. In conclusion, when the dimensions of the input space reduces the smoothness of the likelihood-ratios w.r.t. the RKHS, Kernelbased LRE methods converge slower, and adaptive dimensionality reduction strategies are necessary. This highlights the need for further research in the high dimensional regime.

The role of α-regularization. To investigate the sensitivity of GRULSIF and POOL to the regularization parameter α, we complement the previous experiments by reporting in Fig. 7-11 results for α = {0.01,0.1,0.5}. We can see that tuning α aﬀects convergence, as suggested by Theorem 3. Low α values make the LRE task harder, hence leads to estimates with higher bias and variance, and slower convergence to the true target quantities (this is more evident in the box-plots). Moreover, graph regularization leads to more robust nodelevel estimates, i.e. lower variance within sets of connected nodes and faster convergence, especially for nodes where pv = qv. Finally, when α is closer to 1, GRULSIF and POOL get closer since the target relative likelihood-ratios become easier to estimate even without collaboration. The ﬁndings show that the Collaborative LRE is more robust as α gets closer to 0, when targeting a less regularized likelihood-ratio (see also Sec. 5.4).

The role of the graph size (N). The number of nodes aﬀects GRULSIF s performance, both numerically and in terms of its generalization properties. For the ﬁrst point, notice that under the hypothesis that the size of the dictionary (ˆL) is ﬁxed and the sample size at each node remains the same (nv = n v = n), GRULSIF s complexity scales linearly with N, speciﬁcally at a rate of O(N ˆL3 +n N ˆL2 +N ˆL2 log2(n N ˆL)) (see Sec. 5.2).

Regarding the impact of N on the generalization properties of GRULSIF, notice that the bound of Theorem 3 is dominated by the term: N 1 1+ζ TGΛ2 1 1+ζ , and each term depends on N. Expr. 16 tells us that Λ increases as more likelihood-ratios need to be estimated, while N 1 1+ζ decreases. On the other hand, the new nodes associated to added LRE tasks get wired into the graph and modify its topology, hence aﬀect TG. As a result, either the term TGΛ2 1 1+ζ dominates or increases at the same rate as N 1 1+ζ , which implies that increasing

N would not help GRULSIF to converge faster; or the term N 1 1+ζ dominates, in which case increasing N would be beneﬁcial. These distinct convergence regimes are observed in the example provided in Sec. 4, where the eﬀect of N diﬀers between a completely disconnected graph and a fully connected graph, despite the LRE tasks being the same for all the nodes.

As Λ is an a priori unknown parameter, our current results do not allow the automatic identiﬁcation of the convergence regime as we increase the number of nodes. To demonstrate the eﬀect of node number, we compare GRULSIF and POOL in synthetic scenarios with varying graph sizes (N = {100,500,1000}). The change of measure at each node is detailed in Tab. 1: in Synth.Ia and Synth.Ib we increase the number of nodes per cluster, while in Synth.IIa, Synth.IIb, and Synth.IIc we extend proportionally a Grid graph. The results are reported in Fig. 12-16. As expected, the size of the graph does not aﬀect all scenarios in the same way. In Synth.IIc (Fig. 16), we observe that increasing N leads to faster GRULSIF convergence. Contrary, in Synth.IIb (Fig. 15), including more nodes increases the bias of GRULSIF. POOL appears unaﬀected by the variation of N across all scenarios, as it ﬁts a model at each node independently and hence its convergence rate depends only on the sample size (n).

Collaborative likelihood-ratio estimation over graphs

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

log(P [(f r )2])

GRULSIF POOL RULSIF ULSIF KLIEP

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 2: Experiment Synth.Ia. Comparison of GRULSIF and POOL, with ﬁxed α = 0.1 and varying sample size n = {50,100,250,500} (i.e. n from pv and n from qv, with n = n ), against existing LRE approaches. All methods are built upon the χ2-divergence between pα v and qv, except from KLIEP that uses the KL-divergence. GRULSIF, POOL, and RULSIF target the relative likelihood-ratio with α = 0.1, while ULSIF and KLIEP target the original likelihood-ratio (α = 0). The graph size is ﬁxed at N = 100. Line-plots: Convergence to the respective target true likelihood-ratios, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ). Box-plots: The distribution of node-level φ-divergence estimates obtained by each method for varying sample size n = n . The horizontal dashed lines (red and green) indicate the target true φ-divergence within each cluster.

de la Concha, Vayatis and Kalogeratos

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

log(P [(f r )2])

GRULSIF POOL RULSIF ULSIF KLIEP

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 3: Experiment Synth.Ib. Comparison of GRULSIF and POOL, with ﬁxed α = 0.1 and varying sample size n = {50,100,250,500} (i.e. n from pv and n from qv, with n = n ), against existing LRE approaches. All methods are built upon the χ2-divergence between pα v and qv, except from KLIEP that uses the KL-divergence. GRULSIF, POOL, and RULSIF target the relative likelihood-ratio with α = 0.1, while ULSIF and KLIEP target the original likelihood-ratio (α = 0). The graph size is ﬁxed at N = 100. Line-plots: Convergence to the respective target true likelihood-ratios, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ). Box-plots: The distribution of node-level φ-divergence estimates obtained by each method for varying sample size n = n . The horizontal dashed lines (red and green) indicate the target true φ-divergence within each cluster.

Collaborative likelihood-ratio estimation over graphs

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

log(P [(f r )2])

GRULSIF POOL RULSIF ULSIF KLIEP

n = n = 100

n = n = 250

n = n = 500

Figure 4: Experiment Synth.IIa. Comparison of GRULSIF and POOL, with ﬁxed α = 0.1 and varying sample size n = {50,100,250,500} (i.e. n from pv and n from qv, with n = n ), against existing LRE approaches. All methods are built upon the χ2-divergence between pα v and qv, except from KLIEP that uses the KL-divergence. GRULSIF, POOL, and RULSIF target the relative likelihood-ratio with α = 0.1, while ULSIF and KLIEP target the original likelihood-ratio (α = 0). The graph size is ﬁxed at N = 10 10. Lineplots: Convergence to the respective target true likelihood-ratios, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ). Heatmaps: In this experiment, qv depends on the position of the node in the Grid graph, and so does their true φ-divergence value. In the ﬁrst row, a heatmap shows the true node-level φ-divergence associated to each method. The heatmaps in the following rows show the node-level φ-divergence estimates obtained by each method.

de la Concha, Vayatis and Kalogeratos

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

log(P [(f r )2])

GRULSIF POOL RULSIF ULSIF KLIEP

n = n = 100

n = n = 250

n = n = 500

Figure 5: Experiment Synth.IIb. Comparison of GRULSIF and POOL, with ﬁxed α = 0.1 and varying sample size n = {50,100,250,500} (i.e. n from pv and n from qv, with n = n ), against existing LRE approaches. All methods are built upon the χ2-divergence between pα v and qv, except from KLIEP that uses the KL-divergence. GRULSIF, POOL, and RULSIF target the relative likelihood-ratio with α = 0.1, while ULSIF and KLIEP target the original likelihood-ratio (α = 0). The graph size is ﬁxed at N = 100. Line-plots: Convergence to the respective target true likelihood-ratios, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ). Box-plots: The distribution of node-level φ-divergence estimates obtained by each method for varying sample size n = n . The horizontal dashed lines (red and green) indicate the target true φ-divergence within each node group of interest.

Collaborative likelihood-ratio estimation over graphs

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

log(P [(f r )2])

GRULSIF POOL RULSIF ULSIF KLIEP

n = n = 100

n = n = 250

n = n = 500

Figure 6: Experiment Synth.IIc. Comparison of GRULSIF and POOL, with ﬁxed α = 0.1 and varying sample size n = {50,100,250,500} (i.e. n from pv and n from qv, with n = n ), against existing LRE approaches. All methods are built upon the χ2-divergence between pα v and qv, except from KLIEP that uses the KL-divergence. GRULSIF, POOL, and RULSIF target the relative likelihood-ratio with α = 0.1, while ULSIF and KLIEP target the original likelihood-ratio (α = 0). The graph size is ﬁxed at N = 100. Line-plots: Convergence to the respective target true likelihood-ratios, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ). Box-plots: The distribution of node-level φ-divergence estimates obtained by each method for varying sample size n = n . The horizontal dashed lines (red and green) indicate the target true φ-divergence within each node group of interest.

de la Concha, Vayatis and Kalogeratos

α = 0.01 α = 0.1 α = 0.5

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

100 200 300 400 500 n = n

GRULSIF POOL GRULSIF POOL GRULSIF POOL

α = 0.01 α = 0.01 α = 0.1 α = 0.1 α = 0.5 α = 0.5

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 7: Experiment Synth.Ia for varying α-regularization. Complement of Fig. 2 that focuses on the behavior of GRULSIF and POOL for varying α = {0.01,0.1,0.5}. The graph size is ﬁxed at N = 100. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Box-plots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying α and sample size n = n . The horizontal dashed lines (red and green) indicate the true PE(pα v qv) (See Eq. 32) within each cluster.

Collaborative likelihood-ratio estimation over graphs

α = 0.01 α = 0.1 α = 0.5

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

100 200 300 400 500 n = n

GRULSIF POOL GRULSIF POOL GRULSIF POOL

α = 0.01 α = 0.01 α = 0.1 α = 0.1 α = 0.5 α = 0.5

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 8: Experiment Synth.Ib for varying α-regularization. Complement of Fig. 3 that focuses on the behavior of GRULSIF and POOL for varying α = {0.01,0.1,0.5}. The graph size is ﬁxed at N = 100. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Box-plots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying α and sample size n = n . The horizontal dashed lines (red and green) indicate the true PE(pα v qv) (See Eq. 32) within each cluster.

de la Concha, Vayatis and Kalogeratos

α = 0.01 α = 0.1 α = 0.5

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

100 200 300 400 500 n = n

GRULSIF POOL GRULSIF POOL GRULSIF POOL

α = 0.01 α = 0.01 α = 0.1 α = 0.1 α = 0.5 α = 0.5

n = n = 100

n = n = 250

n = n = 500

Figure 9: Experiment Synth.IIa for varying α-regularization. Complement of Fig. 4 that focuses on the behavior of GRULSIF and POOL for varying α = {0.01,0.1,0.5}. The graph size is ﬁxed at N = 100. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Box-plots: The distribution of node-level estimates obtained for varying α and sample size n = n . Heatmaps: In this experiment, qv depends on the position of the node in the Grid graph, and so does PE(pα v qv) (See Eq. 32). In the ﬁrst row, a heatmap shows the true node-level true value of each method. The heatmaps in the following rows show the node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying α and sample size n = n .

Collaborative likelihood-ratio estimation over graphs

α = 0.01 α = 0.1 α = 0.5

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

100 200 300 400 500 n = n

GRULSIF POOL GRULSIF POOL GRULSIF POOL

α = 0.01 α = 0.01 α = 0.1 α = 0.1 α = 0.5 α = 0.5

n = n = 100

n = n = 250

n = n = 500

Figure 10: Experiment Synth.IIb for varying α-regularization. Complement of Fig. 5 that focuses on the behavior of GRULSIF and POOL for varying α = {0.01,0.1,0.5}. The graph size is ﬁxed at N = 100. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Box-plots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying α and sample size n = n . The horizontal dashed lines (red and green) indicate the true PE(pα v qv) (See Eq. 32) within each node group of interest.

de la Concha, Vayatis and Kalogeratos

α = 0.01 α = 0.1 α = 0.5

100 200 300 400 500 n = n

log(P [(f r )2])

100 200 300 400 500 n = n

100 200 300 400 500 n = n

GRULSIF POOL GRULSIF POOL GRULSIF POOL

α = 0.01 α = 0.01 α = 0.1 α = 0.1 α = 0.5 α = 0.5

n = n = 100

n = n = 250

n = n = 500

Figure 11: Experiment Synth.IIc for varying α-regularization. Complement of Fig. 6 that focuses on the behavior of GRULSIF and POOL for varying α = {0.01,0.1,0.5}. The graph size is ﬁxed at N = 100. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Box-plots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying α and sample size n = n . The horizontal dashed lines (red and green) indicate the true PE(pα v qv) (See Eq. 32) within each node group of interest.

Collaborative likelihood-ratio estimation over graphs

N = 100 N = 500 N = 1000

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500 1.4

log(P [(f r )2])

100 200 300 400 500 1.4

log(P [(f r )2])

GRULSIF POOL GRULSIF POOL GRULSIF POOL

N = 100 N = 100 N = 500 N = 500 N = 1000 N = 1000

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 12: Experiment Synth.Ia for varying graph size. Complement of Fig. 2 that focuses on the behavior of GRULSIF and POOL for varying N = {100,500,1000}. The regularization parameter is ﬁxed at α = 0.1. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Boxplots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying N and sample size n = n . The horizontal dashed lines (red and green) indicate the true value of PE(pα v qv) (See Eq. 32) within each cluster.

de la Concha, Vayatis and Kalogeratos

N = 100 N = 500 N = 1000

100 200 300 400 500 1.0

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500 0.9

log(P [(f r )2])

GRULSIF POOL GRULSIF POOL GRULSIF POOL

N = 100 N = 100 N = 500 N = 500 N = 1000 N = 1000

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

C1 C2 C3 C4

n = n = 100

n = n = 250

n = n = 500

Figure 13: Experiment Synth.Ib for varying graph size. Complement of Fig. 3 that focuses on the behavior of GRULSIF and POOL for varying N = {100,500,1000}. The regularization parameter is ﬁxed at α = 0.1. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Boxplots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying N and sample size n = n . The horizontal dashed lines (red and green) indicate the true value of PE(pα v qv) (See Eq. 32) within each cluster.

Collaborative likelihood-ratio estimation over graphs

N = 100 N = 500 N = 1000

100 200 300 400 500 0.30

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

GRULSIF POOL GRULSIF POOL GRULSIF POOL

N = 100 N = 100 N = 500 N = 500 N = 1000 N = 1000

n = n = 100

n = n = 250

n = n = 500

Figure 14: Experiment Synth.IIa for varying graph size. Complement of Fig. 4 that focuses on the behavior of GRULSIF and POOL for varying N = {100,500,1000} (i.e. from a 10 10 to a 100 100 Grid graph). The regularization parameter is ﬁxed at α = 0.1. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2 (see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Heatmaps: In this experiment, qv depends on the position of the node in the Grid graph, and so does PE(pα v qv) (See Eq. 32). In the ﬁrst row, a heatmap shows the true node-level true value of each method. The heatmaps in the following rows show the node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying N and sample size n = n .

de la Concha, Vayatis and Kalogeratos

N = 100 N = 500 N = 1000

100 200 300 400 500 0.4

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

GRULSIF POOL GRULSIF POOL GRULSIF POOL

N = 100 N = 100 N = 500 N = 500 N = 1000 N = 1000

n = n = 100

n = n = 250

n = n = 500

Figure 15: Experiment Synth.IIb for varying graph size. Complement of Fig. 5 that focuses on the behavior of GRULSIF and POOL for varying N = {100,500,1000}. The regularization parameter is ﬁxed at α = 0.1. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Boxplots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying N and sample size n = n . The horizontal dashed lines (red and green) indicate the true value of PE(pα v qv) (See Eq. 32) within each cluster.

Collaborative likelihood-ratio estimation over graphs

N = 100 N = 500 N = 1000

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

100 200 300 400 500

log(P [(f r )2])

GRULSIF POOL GRULSIF POOL GRULSIF POOL

N = 100 N = 100 N = 500 N = 500 N = 1000 N = 1000

n = n = 100

n = n = 250

n = n = 500

Figure 16: Experiment Synth.IIc for varying graph size. Complement of Fig. 6 that focuses on the behavior of GRULSIF and POOL for varying N = {100,500,1000}. The regularization parameter is ﬁxed at α = 0.1. Line-plots: Convergence to the true relative likelihood-ratio, in terms of the log P α [f rα]2

(see Eq. 51), as a function of the sample size n (i.e. n from pv and n from qv, with n = n ) and α. Boxplots: The distribution of node-level estimates, { ˆ PEα v (Xv X v)}v V (See Eq. 30), obtained for varying N and sample size n = n . The horizontal dashed lines (red and green) indicate the true value of PE(pα v qv) (See Eq. 32) within each cluster.

de la Concha, Vayatis and Kalogeratos

Table 3: GRULSIF s empirical time complexity . Average running time in seconds in synthetic experiments with n = 500 observations at each node. The average is taken over 10 runs of GRULSIF, after the model selection whose computational time is not included.

#nodes (N) Synth.Ia Synth.Ib Synth.IIa Synth.IIb Synth.IIc

100 0.09s 0.24s 0.31s 0.88s 29.19s 500 1.00s 1.25s 2.15s 1.70s 317.19s 1000 2.75s 6.05s 9.33s 5.54s 1395.95s

On a single machine with 12th Gen Intel(R) Core(TM) i7-12700H processor and 16GB of RAM.

Finally, we compare the empirical time complexity of GRULSIF for diﬀerent graph sizes. Tab. 3 reports the average running time over 10 GRULSIF instances, after the model selection step. The time complexity increases with N. For the experiments Synth.Ia-Synth.IIb, this increase does not seem to compromise the scalability of the algorithm. Nevertheless, the overhear for Synth.IIc is considerably bigger. The higher input space dimensionality of the Synth.IIc results in a larger dictionary of anchor points for approximating the associated RKHS. Speciﬁcally, the size of the dictionary scales at rate O(ˆL3), which is evident even for graphs with as few as 100 nodes. In conclusion, the scalability of GRULSIF to large graphs depends more on the size of the dictionary used for approximation rather than on the number of nodes.

6.2 Real-life experiments

In this section, we compare GRULSIF to other Kernel-based LRE methods in examples of seismic data. Although this work focuses on the LRE performance, GRULSIF can be used for designing Change-Point Detection and Two-Sample Testing methods that account for graph-structured information. Such applications have been developed in parallel to this work in separate contributions in de la Concha et al. (2023, 2025). Seismic data fall into the setting described in Sec. 2. Modern geological hazard monitoring systems consist of several stations located across a territory. Each station gathers data on ground noise and shaking at its speciﬁc location. It is expected that stations located close to each other will exhibit similar observations, which is a fundamental hypothesis in Spatial Statistics. We follow here a simple approach that considers the stations as nodes within a graph that encodes the similarity induced by their geographical proximity. The main diﬃculty in evaluating the approximation of LRE methods using real data is that, in most cases, the likelihood-ratio is unknown. In this application, though, there is a speciﬁc case where a likelihood-ratio can be known in advance: the case when there is no seismic activity, i.e. both pv = qv for all v V , implying rα v = 1 for all v V , regardless of α. We leverage this observation to illustrate how incorporating geographical proximity within LRE improves the approximation accuracy.

Seismic data preprocessing. We start by identifying three seismic events that occurred in New Zealand. Seism A is of magnitude 5.5R (in Richter scale), occurred on May 31, 20211; Seism B is of magnitude 2.6R, occurred on Oct 2, 2023 2; Seism C is of magnitude 2.3R,

1. Public data available by the Geo Net project GNS Science (1970): https://www.geonet.org.nz/earthquake/2021p405872 2. https://www.geonet.org.nz/earthquake/2023p741652

Collaborative likelihood-ratio estimation over graphs

Figure 17: New Zealand seismographic network. The purple nodes represent the locations of seismic stations. The edges illustrate a 3-nearest neighbors graph, inferred based on stations geographical coordinates to capture spatial proximity.

Table 4: Results on seismic data set. We report P α [f rα]2 (see Expr. 51) for each LRE method. To estimate this quantity, we preprocess the data corresponding to a time period where rα v 1 for all v V , ﬁt the LRE method using 4/5 of the observations as training set, and the remaining 1/5 as test set. By design, a reported value closer to 0 indicates a better approximation.

LRE Method Seism A Seism B Seism C

KLIEP 0.31 0.36 0.37 ULSIF 0.10 0.17 0.14 RULSIF (α = 0.1) 0.10 0.16 0.13

POOL (α = 0.1) 0.14 0.20 0.08 GRULSIF (α = 0.1) 0.08 0.09 0.02

occurred on Oct 30, 2024 3. The data is made public available by the Geo Net project that operates a geological hazard monitoring system in that territory. The system is composed by seismic stations equipped with strong-motion accelerometers that provide 3d signals corresponding to the shaking across three perpendicular directions. To emulate the situation where pv and qv are approximately the same, we analyze the waveforms recorded during one minute time, between 30 and 29 minutes before the seismic event at 100Hz frequency. The purpose of this choice is to ensure that observations within this time-window are stationary. The waveforms correspond to the measurements provided by the three perpendicular directions of the accelerometers, deﬁning the input space X R3. The preprocessing is performed independently for each station, and independently for each direction, using the Obspy Python package (Beyreuther et al., 2010). We apply a series of standard preprocessing steps used in seismology: the instrument response is deconvolved, the linear trend is removed, the observations are demeaned, a 2-20 bandpass ﬁlter is applied, and the ﬁltered data are downsampled by a factor of 5. To reduce the temporal dependency, we ﬁt an autoregressive model of order 1, and we keep the residuals to analyze further. The output is then standardized so that it has zero mean and unit variance. After completing these steps, we obtain 1200 observations at each location. We then assign the ﬁrst 600 observations to pv and the remaining 600 of them to qv. To account for spatial similarity, we generate an unweighted spatial graph GS = (V,E,W) where the nodes represent the seismic stations and the edges are computed in order to form a spatial 3-nearest neighbors graph, as visualized in Fig. 17.

3. https://www.geonet.org.nz/earthquake/2024p817566

de la Concha, Vayatis and Kalogeratos

Results and ﬁndings. Tab. 4 compares the Kernel-based LRE methods listed in Tab. 2 in terms of their average node-level MSE (Eq. 51). To compute this measure, we need both the approximated relative likelihood-ratio f and the true rα. The former is given by each LRE estimator trained on 80% of the observations, while by design we have rα v (x) = (1,...,1) RN

for any α and x X. This choice is consistent with the sampling from pv and qv, which is done such that pv qv. The expected value Epα(y)[[fv rα v ]2(y)] is then calculated by averaging the estimation result on the remaining 20% of the observations that were not used during the training phase. In this conﬁguration, we expect Epα(y)[[fv rα v ]2(y)] to be low and close to zero. In the results of Tab. 4, GRULSIF achieves the best performance (lowest MSE) for all the seismic events, which highlights the importance of the graph structure in enhancing the likelihood-ratio approximation.

7. Conclusions

In this paper, we ﬁrst introduced a graph-based extension to the likelihood-ratio estimation (LRE) problem that we call Collaborative LRE. Then, we presented GRULSIF: a novel collaborative non-parametric LRE framework for multiple data sources whose similarity is encoded in a given ﬁxed graph. We provided a detailed convergence analysis that highlights the conditions under which collaboration is beneﬁcial compared to individual local LRE at each node, but also more speciﬁcally the role played by important variables of the problem, such as the complexity of the problem at hand, the amount of available data for each local problem, the expressiveness of the graph structure chosen for estimation, and the number of nodes. The provided distributed GRULSIF implementation scales well for big graphs. This is supported by the computational complexity analysis as well the reported empirical running time. In fact, the results show that the scaling of the algorithm is more aﬀected by the size of the dictionary used for approximation than on the number of nodes. As future work, there can be applications enabled by GRULSIF, some of which were outlined in this article. The presented theoretical guarantees could be used to study the behavior of task-oriented algorithms built on the top of GRULSIF estimates. Finally, an interesting question that deserves thorough investigation in the future, is how we can choose a graph that would render Collaborative LRE meaningful and more eﬃcient.

Acknowledgments

The authors acknowledge support from the Industrial Data Analytics and Machine Learning Chair hosted at ENS Paris-Saclay, Universit e Paris-Saclay, and the Ile-de-France Region.

Collaborative likelihood-ratio estimation over graphs

Appendix A. Methodological aspects

A.1 Connection between Pearson s divergence and likelihood-ratio estimation

This section explains the relationship between χ2-divergence and likelihood-ratio estimation (LRE), which justiﬁes why we formulate the problem of comparing probabilistic models deﬁned over the nodes of a graph as a LRE problem. In other words, we prove Theorem 2.

Proof of Theorem 2. The χ2-divergence refers to the φ-divergence with φ(ζ) = (ζ 1)2

2 , which is strictly convex around 1 and essentially smooth. Its convex conjugate is given by φ (s) = s2

2 +s. Furthermore, from Inequality 2, we can conclude that PE(P α Q) is bounded. As Q P(X), then the total variation measure becomes |Q| = Q and Q MH. To verify the last point, take f H, then: Z f(x ) d Q(x ) = Z f,K(x , ) H d Q(x ) C f H < ,

where the ﬁrst equality is given by the representer property of H, and the second one is a consequence of the Cauchy-Schwarz inequality. The upper-bound on rα implies that the function φ (rα(x)) = rα(x) 1 belongs to span(H B), thus the requirements of Theorem 1 are satisﬁed and we obtain the expression:

PE(P α Q) = sup g span(H B)

Z g(x )d Q(x ) Z g2(y)

2 +g(y) d P α(y), (52)

which admits the unique optimal solution g = rα(x) 1. Let us rewrite Eq. 52 in terms of functions of the form g = f 1:

PE(P α Q) = sup f span(H B)

Z (f(x ) 1)d Q(x ) Z (f(y) 1)2

2 +(f(y) 1) d P α(y)

= sup f span(H B)

Z f(x )d Q(x ) Z f2(y)

2 d P α(y) 1

Z f(x )d Q(x ) Z f2(y)

2 d P α(y) 1

A.2 Creating an eﬃcient dictionary

In this section, we describe in detail how the implementation of GRULSIF creates the dictionary. Our strategy is an adaptation of the greedy algorithm introduced by Richard et al. (2009) that deﬁnes the dictionary s coherence, which measures the redundancy in a set of basis functions with regard to their lineal dependency and is computable in linear time to the dataset size. In practice, this approach selects a subset of observations (datapoints) forming a non-redundant subset of basis functions with good approximation performance. This strategy is applicable to a unit-norm kernel (i.e. K(x,x) = 1, x X). Formally, the

de la Concha, Vayatis and Kalogeratos

Algorithm 3 Dictionary creation

1: Input: {Xv}v V = {{xv,1,...,xv,nv}}v V , {X v}v V = {{x v,1,...,x v,n v}}v V : the observations indexed by the node of the graph G = (V,E,W) they belong; 2: µ0 (0,1) : a coherence threshold; 3: σ : kernel hyperparameter (we assume a unit-norm kernel); 4: Output: D: a dictionary containing selected elements from {Xv}v V and {X v}v V .

5: D = {} 6: for v V do 7: Select non-redundant elements from X v 8: for i {1,..,nv} do 9: if maxx Dv |Kσv(xv,i,x)| µ0 then 10: D = D {xv,i} 11: end if 12: end for 13: Select non-redundant elements from X v 14: for i {1,..,n v} do 15: if maxx Dv Kσv(x v,i,x) µ0 then 16: D = D {x v,i} 17: end if 18: end for 19: end for 20: return D

coherence of a dictionary DL = {xl}L l=1 of size L is deﬁned as:

µ = max l =l | ϕ(xl), ϕ(xl ) | = max l =l |K(xl,xl )|. (54)

This quantity can be read as the largest level of collinearity between pairs of elements of the dictionary. When the basis functions are orthogonal, this quantity equals to zero. The greedy recipe processes the observations one after the other and integrates a new observation x to the current dictionary D if its addition keeps the dictionary coherence bellow a given threshold µ0 (0,1), that is if: maxxl DL |K(x,xl)| µ0. The strategy is detailed in Alg. 3. In all the reported experiments, we ﬁx µ0 = 0.3. Larger values of µ0 increased the running time of GRULSIF without oﬀering performance gains.

A.3 Analysis of the proposed optimization algorithm

In this section, we detail the Cyclic Block Coordinate Gradient Descent (CBCGD) strategy described in Sec. 5. In particular, we prove an upper-bound for the number of interactions to attain a given precision ϵ in terms of the size of the dictionary L and the number of nodes N. Theorem 4 is a particular case of the results appearing in Li et al. (2018). In that work, the convergence of Cyclic Block Coordinate-type algorithms is analyzed. For completeness of the presentation, we present some of their main results. The objective functions analyzed in Li et al. (2018) takes the form:

min Θ RM Φ(Θ) = min θ RM Z(Θ)+R(Θ), (55)

Collaborative likelihood-ratio estimation over graphs

where Z is a twice diﬀerentiable loss function, R is a possibly non-smooth and strongly convex penalty function, and the variable Θ is of dimension M = PN v=1 Mv and is partitioned into disjoint blocks Θ = (θ1,θ2,...,θN) each of them being of dimension Mv. It is supposed that the penalization term can be written as R(Θ) = PN v=1 Rv(θv).

Assumption 5 Z( ) is convex, and its gradient mapping Z( ) is Lipschitz-continuous and also block-wise Lipschitz-continuous, i.e. there exist positive constants C and Cv such that for any Θ,Θ RM and v = 1,...,N, we have:

Z(Θ ) Z(Θ) C Θ Θ v Z(θ u<v,θv,θ u>v) v Z(Θ ) Cv θv θ v . (56)

Assumption 6 R( ) is strongly convex and blockwise strongly convex, i.e. there exist positive constants µ and µv s such that for any Θ,Θ RM and v V , we have for any ξ R(Θ ):

R(Θ) R(Θ )+(Θ Θ )Tξ + µ

Rv(θv) R(θ v)+(θv θ v)Tξv + µv

θv θ v 2 . (57)

Under the above assumptions, the CBCGD cycle i for block v is deﬁned as:

ˆθ(i) v = argmin θv (θv ˆθ(i 1) v )T v Z(ˆθ(i) u<v, ˆθ(i 1) u v )+ ηv

θv ˆθ(i 1) v 2 +Rv(θv). (58)

Then, Theorem A.1 characterizes the maximum number of interactions required to achieve a pre-speciﬁed accuracy ϵ.

Theorem 5 (Theorem 3 in Li et al. (2018)) Suppose that Assumptions 5 and 6 hold with M 2. And that the optimization point is Θ . We choose αv = Cv for the CBCGD method. Given a pre-speciﬁed accuracy ϵ > 0 of the objective value, we need at most

imax = µCµ min +16C2 log2(3NMmax)

µCµ min log Φ(Θ(0)) Φ(Θ )

iterations to ensure Φ(Θ(i)) Φ(Θ ) < ϵ for i imax, where Cµ min = minv V Cv + µv and Mmax = maxv V Mv.

Proof of Theorem 4. In this section, we assume that K is positive-deﬁnite,i.e. its minimum eigenvalue is strictly positive: emin(K) > 0. This is in not the general case, but we can transform the problem for this condition to hold, e.g. using the approximation K = K+c IL, with c > 0. Alternatively, K can be ensured to be positive-deﬁnite by selecting a dictionary with linear independent components (Richard et al., 2009), or via Nystr om approximations along with anchor points selected via Kernel-PCA, as described in Sec. 5.2.

de la Concha, Vayatis and Kalogeratos

Problem 25 takes the form of Expr. 55, where we identify the functions Z and R as:

Z(Θ) = min Θ RNL 1 N

2 ΘTH Θ h TΘ

2 ΘT(IN K 1 2 )T [L IL](IN K 1 2 )Θ

2 θT v Hvθv + α

2 θT v H vθv h T v θv

u,v V Wuv(θv θu)TK(θv θu),

v V Rv(θv) = λγ

v V θT v Kθv.

That given, it is easy to verify that the updating scheme of Eq. 58 takes the form of Eq. 40. It is clear that, given our hypothesis, R(Θ) and Rv(θv) are stronger convex functions of modulus λγemin(K). Therefore, Assumption 6 is satisﬁed. Second, the full gradient of Z( ) can be written as:

N H +λ(IN K 1 2 )T [L IL](IN K 1 2 ) Θ 1

which is Lipschitz-continuous with constant

N H +λ(IN K 1 2 )T [L IL](IN K 1 2 ) .

From the node-level expression, it is easy to derive the partial derivative of Z( ):

v Z(Θ) = 1 α

N H vθv +λK

u ng(v) Wuv θu1u<v +θu1u>v !

N h v, (60)

where dv is the degree of node v. This means:

v Z(θ u<v,θv,θ u>v) v Z(Θ )

N H v +λdv K θv θ v

Cv (θv θ v) , (61)

where Cv = emax 1 α

N H v +λdv K . Then, Assumption 5 is satisﬁed. With these elements, and by ﬁxing ηv = Cv, we can apply Theorem 5, where:

N H +λ(IN K 1 2 )T [L IL](IN K 1 2 ) ,

N H v +λdv K ,

Cµ min = min v V Cv +µ,

After substitution, we get the expression given in Eq. 41.

Collaborative likelihood-ratio estimation over graphs

Appendix B. GRULSIF in practice and details for the empirical evaluation

B.1 Further details for the conducted experiments

This section details the hyperparameter selection used in the experiments. For the RULSIF and ULSIF algorithms, we follow Sugiyama et al. (2011a) and Yamada et al. (2011). We run a leave-one-out cross-validation procedure over the parameter associated with the Gaussian kernel and the penalization term γ. The parameter σ is selected from the grid {0.6σmedian,0.8σmedian,1σmedian,1.2σmedian,1.4σmedian}, where σmedian is the parameter σ found via the median heuristic over the observations in x v. On the other hand, the penalization parameter γ is selected from the grid {1e 5,1e 3,0.1,10}. The procedure for KLIEP is similar, but we use instead a 5-fold cross-validation procedure, over the grid {0.6σmedian,0.8σmedian,1σmedian,1.2σmedian,1.4σmedian} for the width σ of the Gaussian kernel, and over the grid {1e 5,1e 3,0.1,10} for the penalization constant. For GRULSIF and POOL, we apply 5-fold cross-validation for selecting the hyperparameters σ, γ, λ (Alg. 1). Since POOL ignores the graph, we ﬁx λ = 1, and the penalization term related with the norm of each functional fv will only depend on the parameter γ. In order to select the width σ of the Gaussian kernel, we ﬁrst compute {σv}v V for each node via the median heuristic applied to the observations of Xv (those are available when creating the dictionary), and we deﬁne σmin = min{σv}v V , σmedian = median{σv}v V , and σmax = max{σv}v V , we then choose the ﬁnal parameter from the set {σmin, 1

2(σmin + σmedian),σmedian, 1

2(σmax +σmedian),σmax}. γ is selected from the set {1e 5,1e 3,0.1,10}/c, using c = p

min(n,n ) for POOL. For GRULSIF, we select the optimal λ from the set {1e 5,5.62e 4,3.16e 2,1.77,1e2}/c, and the optimal ˆγ (representing the product of the constants γ λ) from the set {1e 5,1e 3,0.1,10}/c, where c = p

min(n,n )N.

Appendix C. GRULSIF convergence guarantees

C.1 Auxiliary concepts and results from Multitask Learning

The excess risk bounds for Multitask Learning in Youseﬁet al. (2018) depend on the concept of Multitask Local Rademacher Complexity (MTLRC) that aims at quantifying the complexity of classes of vector-valued functions. MTLRC leads to sharper bounds compared to the classical Global Rademacher Complexity, while remains easy to compute for a VV-RKHS that is of our interest. Speciﬁcally, the bounds are tight enough to explicit the role of important variables of the problem, such as the number of observations, the number of tasks, the smoothness of the vector-valued function to approximated in terms of the norm in the associated VV-RKHS. For completeness of presentation, and in order to clarify better our results, we adapt to the notation used in the main text and rewrite important concepts and results appearing in the reference papers Bartlett et al. (2005); Youseﬁet al. (2018). Let us denote by Z = (zv,i)v V,i=1,...,n a set of n N independent observations such that for each v {1,...,N}, {zv,i pz,v}n i=1 are identically distributed according to the measure pz,v. Given a vector-valued function h = (h1,...,hv), we deﬁne the expressions:

v V Epz,v[hv(z)] and Pz,n[h] = 1

i=1 hv(zv,i). (62)

de la Concha, Vayatis and Kalogeratos

We start by introducing the concept of MTLRC (Youseﬁet al., 2018) and the sub-root function (Bartlett et al., 2005) that will appear in the upper-bounds of the excess risk.

Deﬁnition 6 (Multitask Local Rademacher Complexity) For a vector-value function class F = {f = (f1,...,f N)}, the Multitask Local Rademacher Complexity (MTLRC) for ρ > 0, R(F,ρ), is deﬁned as:

R(F,ρ) = Ez,σ

sup V (f) ρ f=(f1,...,f N) F

i=1 σv,ifv(zv,i)

where {σv,i}v=1,...,N;i=1,...,n is a sequence of independent Rademacher variables. We denote by Ez,σ [ ] the expectation w.r.t. all the involved random variables. V (f) is an upper-bound on the variance of the function in F.

Deﬁnition 7 (Sub-root function) A function ϱ : [0, ] [0, ] is sub-root iﬀit is nondecreasing and the function ϱ(ρ) ρ is non-increasing for ρ > 0.

Lemma 8 (Lemma 3.2 in Bartlett et al. (2005)) If ϱ is a nontrivial sub-root function, then it is continuous in [0, ] and the equation ϱ(ρ) = ρ has a unique non-zero solution ρ , which is known as the ﬁxed point of ϱ. Moreover, for any ρ > 0, it holds that ρ ϱ(ρ) iﬀρ ρ.

The following result is established when the MTLRC is itself a sub-root function.

Lemma 9 (Lemma 3.4 in Bartlett et al. (2005)) If the class F is star-shaped around f0, and V : F R+ is a function that satisﬁes V (af) a2V (f) for any f F and any a [0,1], then the function ϱ deﬁned for ρ 0 by:

sup V (f f0) ρ f F

i=1 σv,ifv(zv,i)

is a sub-root function and ρ Ez[ϱ(ρ)] is also sub-root function.

A function class being star-shaped around f0, means that {f0+a(f f0) : f F,a [0,1]} F. Note that the a convex function class is by deﬁnition star-shaped around each of its elements. MTLRC will be used to obtain a global error bound for classes of vector-valued functions for which the variance is bounded, V (f f0) ρ. The goal is to identify models of small generalization error and small variance. There is a tradeoﬀbetween the size of the subset we consider (controlled by the parameter ρ) and its complexity, the optimal choice is given by a ﬁxed point of a sub-root function. To formalize the relationship between the generalization error and variance, we need to deﬁne the concept of Vector-Valued Bernstein Class.

Deﬁnition 10 (Vector-Valued Bernstein Class) Let 0 < β 1 and B > 0. A vectorvalued function class F is said to be a (β,B)-Bernstein class w.r.t. the probability measure P if there exists a function V : F R+ such that

Pf2 V (f) B(Pf)β, f F. (65)

Collaborative likelihood-ratio estimation over graphs

The following result describes the role of MTLRC in obtaining upper-bounds for Multitask Learning and it is the core component in the proof of Theorem 3.

Theorem 11 (Theorem B.3 in Youseﬁet al. (2018)) Let F = {f = (f1,...,f N)} be a class of vector-valued functions satisfying maxv V supz Z |fv(z)| b. Let Z = {Zv}v V = {zv,1,...,zv,n}

v V be a vector of n N random variables where for each v V , {zv,1,...,zv,n} are identically distributed. Assume that F is (β,B)-Bernstein class of vector-valued functions with 0 < β 1 and B 1. Let ϱ be a sub-root function with ﬁxed point ρ . If BR(F,r) ϱ(ρ), ρ ρ , then for any C > 1, and δ (0,1), with probability at least 1 δ, every f F satisﬁes:

Pz[f] C C β Pz,n[f]+(2C)

β 2 β 20 2 2 β max (ρ ) 1 2 β ,(ρ ) 1 β

1 2 β + 24Bb (2 β)n N log 1

C.2 Lemmata before Theorem 3

The general idea is to use Theorem 11 to upper-bound the excess risk associated with the Problem 33. To this end, we need to address the following subproblems:

1. Deﬁne a class of vector-valued functions satisfying the hypotheses of Theorem 11. (This is the context of Lemma 14.)

2. Identify the sub-root function ϱ that upper-bounds the MTLRC of the class. (Provided in the last point of Lemma 14.)

3. Upper-bound the ﬁxed point of ϱ. (See Lemma 17.)

Let us start by deﬁning the instantaneous loss function for the scalar function f H:

ℓv(f)(zv) = (1 α)f2(xv)+αf2(x v) 2 f(x v). (67)

Here, the variable zv denotes a pair of observations zv = (xv,x v), where xv pv and x v qv. Given a vector-valued function f = (f1,f2,...,f N) G and a set of pairs of observations (z1,z2,...,z N), we deﬁne the vector-valued loss function:

ℓ(f) = (ℓ1(f1)(z1),ℓ2(f2)(z2),...,ℓN(f N)(z N)). (68)

To facilitate reading, we introduce the following operators evaluated at vector-valued functions of the form h = (h1,...,h N):

v V Epv(x) [hv(x)], Q[h] = 1

v V Eqv(x ) hv(x ) , P α[h] = 1

v V Epα v (y)[hv(y)].

(69) We can easily verify the following expressions:

P α[h] = (1 α)P[h]+αQ[h] and P α[rαh] = Q[h], (70)

de la Concha, Vayatis and Kalogeratos

where, with an abuse of notation, rαh refers to the point-wise multiplication of the vectorvalued functions rα and h. With this notation, we can deﬁne the cost function:

v V Epz,v [ℓv(fv)(z)] = Pz[ℓ(f)]. (71)

The following lemma identiﬁes the connection between the excess risk L(f) L(rα) and the distance P α[f rα]2. In particular, it makes evident the advantages of using the χ2divergence as a surrogate loss function for LRE.

Lemma 12 Let the vector-valued functional space FG (Expr. 33) and suppose the value of each scalar function fv to range in [ b,b], b R+. Then the following statements hold:

1. There is a function f = (f 1 ,...,f N) G satisfying:

f = argmin f FG

2 Epv(x)[f2 v (x)]+ α

2 Eqv(x )[f2 v (x )] Eqv(x )[fv(x )]

= argmin f FG L(f).

2. For every f G, we have P α [f f ]2 = 2(L(f) L(f )).

3. There exists B0 > 0, such that f G:

Pz[ℓ(f) ℓ(f )]2 B0

2 P α[f f ]2 = B0Pz[ℓ(f) ℓ(f )].

Proof. First point: Assumption 3 states that rα FG. Following the line of reasoning

used to prove Expr. 13, we can conclude:

argmin f FG

2Epα v (y)[f2 v (y)] Eqv(x )[fv(x )] = argmin f FG

1 2Epα v (y)[(fv(y) rα v (y))2],

which implies rα = (rα 1 ,...,rα N) is solution to the optimization problem.

Second point: The proof of the previous point implies f = rα. Then the second point of the lemma can be restated in terms of L(f) L(rα) for f FG:

L(f) L(rα) = 1

2Epα v (y)[f2 v (y) (rα v )2(y)] Eqv(x )[ fv(x ) rα v (x ) ]

v V Epα v (y)

2 f2 v (y) (rα v )2(y) rα v (y)(fv(y) rα v (y))

1 2Epα v (y)[[fv rα v ]2(y)] = 1

2P α[f rα]2,

where the second inequality comes from the expression Epα v (y)[rα v (y)g(y)] = Eqv(x )[g(x )].

Third point: Let us deﬁne the positive constants: Cα,v = min{ 1

α, rα v }, Cα = maxv V Cα,v. Notice that by hypothesis over the functional space FG and the upper-bound of rα v with respect to the regularization parameter, we have:

fv +rα v (b+Cα,v) (b+Cα), (72)

Collaborative likelihood-ratio estimation over graphs

Pz [ℓ(f) ℓ(rα)]2 = Pz

2 [f2 (rα)2](x)+ α

2 [f2 (rα)2](x ) [f rα](x ) 2

2 [f2 (rα)2](x)+ α

2 [f2 (rα)2](x ) 2

+2Q [f rα]2(x ) (Inequality (a+b)2 2a2 +2b2)

4 f2 (rα)2 (x) 2 + α

4 f2 (rα)2 (x ) 2

+2Q h [f rα]2 (x ) i (Convexity of x x2)

2P α f2 (rα)2 2 +2P α rα(f rα)2 (Expr. 70)

2P α h [f rα]2 [f +rα]2i + 2

αP α h [f rα]2i

b+Cα 2 +4Cα

P α h [f rα]2i (Expr. 72)

2B0P α h [f rα]2i .

Moreover, the second point implies:

Pz [ℓ(f) ℓ(rα)]2 B0 [L(f) L(rα)] = B0Pz [ℓ(f) ℓ(rα)].

Lemma 13 Let Y = {Yv}v V = {yv,1,...,yv,n}

v V be a sample of n N observations such

that v,i : yv,i i.i.d. pα v . Then:

sup P α[f rα]2 ρ f FG

i=1 σv,ifv(yv,i)rα(yv,i)

sup P α[f rα]2 ρ f FG

i=1 σv,ifv(yv,i)

Proof Let us deﬁne Eσ\σu,j[ ] to be the expectation with respect to all the Rademacher random variables {σv,i}v=1,...,N;i=1,...,n except σu,j, then:

sup P α[f rα]2 ρ f FG

i=1 σv,ifv(yv,i)rα v (yv,i)

sup P α[f rα]2 ρ f FG

i=1 σv,ifv(yv,i)rα v (yv,i)

n N Eσ\σu,j

sup P α[f rα]2 ρ f FG

UV \u; j(f)+σu,jfu(yu,j)rα u(yu,j)

de la Concha, Vayatis and Kalogeratos

where UV \u; j(f) = P v V Pn i=1 σv,ifv(yv,i)rα v (yv,i) σu,jfu(yu,j)rα u(yu,j). By the deﬁnition

of the supremum, for any ϵ > 0, there exists g,h FG such that P α h [g rα]2i ρ and

P α h [h rα]2i ρ, such that:

UV \u; j(g)+gu(yu,j)rα u(yu,j) (1 ϵ)

sup P α[f rα]2 ρ f FG

UV \u; j(f)+fu(yu,j)rα u(yu,j)

UV \u; j(h) hu(yu,j)rα u(yu,j) (1 ϵ)

sup P α[f rα]2 ρ f FG

UV \u; j(f) fu(yu,j)rα u(yu,j)

This latter implies:

sup P α[f rα]2 ρ f FG

UV \u; j(f)+σu,jfu(yu,j)rα u(yu,j)

sup P α[f rα]2 ρ f FG

UV \u; j(f)+fu(yu,j)rα u(yu,j) + sup P α[f rα]2 ρ f FG

UV \u; j(f) fu(yu,j)rα u(yu,j) #

2 UV \u; j(g)+gu(yu,j)rα u(yu,v) + 1

2 UV \u; j(h) hu(yu,j)rα u(yu,v)

2 UV \u; j(g)+UV \u; j(h)+srα u(yu,j)(gu(yu,j) hu(yu,j)) ,

where s = sgn(gu(yu,j) hu(yu,j)). Then, the upper-bound on rα u implies:

sup P α[f rα]2 ρ f FG

UV \u; j(f)+σu,jfu,j(yu,j)rα u(yu,j)

2 UV \u; j(g)+UV \u; j(h)+Cαs(gu(yu,j) hu(yu,j))

2 UV \u; j(g)+Cαsgu(yu,j) + 1

2 UV \u; j(h) Cαshu(yu,j)

2 sup P α[f rα]2 ρ f FG

UV \u; j(f)+Cαsfu(yu,j) + 1

2 sup P α[f rα]2 ρ f FG

UV \u; j(f) Cαsfu(yu,j)

sup P α[f rα]2 ρ f FG

UV \u; j(f)+Cασu,jfu,j(yu,j)

where in the last equality, we have used the deﬁnition of σu,j. As the inequality is satisﬁed for all ϵ > 0, we have:

sup P α[f rα]2 ρ f FG

UV \u; j(f)+σu,jfu,j(yu,j)rα u(yu,j)

sup P α[f rα]2 ρ f FG

UV \u; j(f)+Cασu,jfu,j(yu,j)

Collaborative likelihood-ratio estimation over graphs

Using the same argument for the rest σv,i terms for v = u, i = j, we get the ﬁnal result.

Lemma 14 identiﬁes the vector-valued function class satisfying the hypotheses of Theorem 11.

Lemma 14 Let us deﬁne the class of functions:

HG = {hf = (hf1,...,hf N ),hfv : (xv,x v) ℓv(fv)(xv,x v) ℓv(rα v )(xv,x v),f FG}, (75)

which satisﬁes the following points:

1. maxv V sup(x,x ) X |hfv(x,x )| B1.

2. HG is a (β,B)-Bernstein class with β = 1 and B = B0 = 1

2 (b+Cα)2 +4Cα .

3. HG satisﬁes the following inequality:

R(HG,ρ) 2(b+Cα)R FG, ρ 2B0

where the two involved MTLRCs are deﬁned by:

R(HG,ρ) = Ez,σ

sup V (hf ) ρ,f FG

i=1 σv,ihfv(xv,i,x v,i)

R(FG,ρ) = Epα,σ

sup P αf 2 ρ f FG

i=1 σv,ifv(yv,i)

Proof. First point:

max v V sup (x,x ) X

hfv(x,x ) = max v V sup (x,x ) X X

ℓv(fv)(x,x ) ℓv(rα v )(x,x )

max v V sup (x,x ) X X

f2 v (rα v )2 (x) + α

f2 v (rα v )2 (x ) + [fv rα v ](x )

2(b+Cα)2 +(b+Cα) =: B1. (Expr. 72)

Second point: Due to the properties listed in Lemma 12, we have the following inequalities:

Pz [hf]2 = Pz [ℓ(f) ℓ(rα)]2 B0

2 P α [f rα]2 = B0Pz [ℓ(f) ℓ(rα)], (78)

which means HG is a (β,B)-Bernstein class of vector-value functions, with β = 1 and B = B0, and the function controlling the variance of the class is deﬁned as: V (hf) = B0

2 P α(f rα)2.

de la Concha, Vayatis and Kalogeratos

Third point: By ﬁxing ρ R+, we can verify:

B0R(HG,ρ) = B0Ez,σ

sup V (hf ) ρ,f FG

i=1 σv,iℓv(fv)(xv,i,x v,i)

sup V (hf ) ρ,f FG

2 f2 v (xv,i)+ α

2 f2 v (x v,i) fv(x v,i) #

sup V (hf ) ρ,f FG

2 f2 v (xv,i)+ α

2 f2 v (x v,i) #

sup V (hf ) r,f FG

i=1 σv,ifv(x v,i)

( Subadditivity of the supremum and symmetry of the Redemacher variables)

sup V (hf ) ρ,f FG

1 2σv,if2 v (yv,i)

sup V (hf ) r,f FG

i=1 σv,ifv(yv,i)rα v (yv,i)

where the last expression is a consequence of Epα v (y)[h(y)] = (1 α)Epv(x)[h(x)]+αEqv(x )[h(x )] and Epα v (y)[f(y)rα(y)] = Eqv(x )[g(x )]. Notice, x2 is a Lipschitz function with Lipschitz constant 2b when x [ b,b]. We can apply the Contraction property of Rademacher Complexity, which holds also for the Local Rademacher Complexity for vector-valued function classes (Theorem 17 in Maurer (2006a)). The latter result leads to the inequality:

sup V (hf ) ρ,f FG

i=1 σv,if2 v (yv,i)

sup V (hf ) ρ,f FG

i=1 σv,ifv(yv,i)

By combining the above inequality and Lemma 13 we obtain:

B0R(HG,ρ) B0 (b+Cα)Epα,σ

sup V (hf ) ρ,f FG

i=1 σv,ifv(yv,i)

= B0 (b+Cα)Epα,σ

2 P α[f rα]2 ρ f FG

i=1 σv,ifv(yv,i)

= B0 (b+Cα)Epα,σ

2 P α[f rα]2 ρ f FG

i=1 σv,i[fv(yv,i) rα v (yv,i)]

(By the independence of σ, Y)

B0 (b+Cα)Epα,σ

2 P α[f g]2 ρ f,g FG

i=1 σv,i[fv(yv,i) gv(yv,i)]

= B0 (b+Cα)Epα,σ

sup 2B0P αf 2 ρ f FG

i=1 σv,ifv(yv,i)

= 2B0 (b+Cα)R FG, ρ 2B0

Collaborative likelihood-ratio estimation over graphs

In the last inequality we have used the symmetry of the Rademacher variables and the fact that f FG is symmetric and convex.

Lemma 9 implies that 2B0 (b+Cα)R(FG, ρ 2B0 ) is a sub-root function. The goal now is to upper-bound its ﬁxed point ρ . This requires exploiting the properties of the graph regularization and the capacity condition associated with the covariance operators {Σv}v V . Big part of this analysis has been done in Youseﬁet al. (2018). We rewrite their most relevant results, and rework the upper-bounds to obtain clearer expressions for our problem.

Theorem 15 (Theorem 11 in Youseﬁet al. (2018)) Let the regularizer be f 2 G as deﬁned in Eq. 16, and denote its dual norm by . Let the kernels be uniformly bounded, and deﬁne the sample Y = {Yv}v V = {yv,1,...,yv,n}

v V , and where v V , {yv,1,...,yv,n} is a i.i.d. sample drawn from pα v . Assume that for each v V , the associated covariance operator admits an eigenvector decomposition Σv = Epα v (y)[ϕ(y) ϕ(y)] = P i N µv,i ϕv,i ϕv,i, where { ϕv,i}v V forms an orthonormal basis of H and {µv,i} i=1 are the corresponding eigenvalues in non-increasing order. Then, for any given positive operator D on RN, any ρ > 0 and any non-negative integers h1,...,h N:

ρP v V hv n N +

2Λ N Epα,σ h D 1

2 V i , (80)

where V = n P j>hv 1

n Pn i=1 σv,iφ(yv,i), ϕv,j H ϕv,j o

Lemma 16 (Expr. C.9 of Corollary 22 in Youseﬁet al. (2018)) Under the hypotheses of the previous theorem, we have the following inequality:

Epα,σ h D 1

j>hv µv,j , (81)

where {D 1 vv }v V are the diagonal elements of D 1.

Lemma 17 If Assumptions 2-3 are satisﬁed and FG is a class of functions ranging in [ b,b], b R+, then 2B0(b + Cα)R(FG, ρ 2B0 ) is a sub-root function ﬁxed point ρ satisfying the upper-bound:

TGΛ2(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

1 1+ζ max , (82)

where TG = βγ 1 +(1 β)λ 1 min+ (83)

encodes the graph topology, ζ = minv V ζv, smax = maxv V sv, β = #C(G)

N [0,1] is the ratio between the number of connected components and the number of nodes N, λmin+ denotes the lowest nonzero eigenvalue of L, and by convention, we set (1 β)λ 1 min+ = 0 when #C(G) = N.

de la Concha, Vayatis and Kalogeratos

Proof Combining Theorem 15 and Lemma 16 leads us to the following inequality:

2B0(b+Cα)R(FG, ρ 2B0 ) 2B0(b+Cα)

ρP v V hv 2B0n N +

v u u t 2Λ2

2B0 (b+Cα)2 ρP v V hv n N +2B0 (b+Cα)

v u u t 2Λ2

= a1ρ+a2, (84) where in the last expression we have introduced the variables:

a1 = 2B0 (b+Cα)2 P v V hv

, a2 = 2B0(b+Cα)

v u u t 2Λ2

Now, we will look for the solution to the equation a1ρ+a2 = ρ, which is equivalent to solve ρ2 (a1 +2a2)ρ+a2 2 = 0, that is

ρ = (a1 +2a2) p

a2 1 +4a2 2 a1 +2a2. (86)

As ρ is the ﬁxed point of 2B0(b+Cα)R(FG, ρ 2B0 ), then by Lemma 8, we have:

ρ ρ a1 +2a2. (87)

The goal now is to upper-bound both the terms a1 and a2 by exploiting Assumption 4. Observe that by the capacity condition (Assumption 4), we have: X

j>hv µv,j X

j>hv s2 vj ζv sv

hv x ζvdx = sv 1 ζv h1 ζv v ,

which implies: a2 2B0 (b+Cα)

D 1 vv sv 1 ζv h1 ζv v .

Moreover, by the Cauchy Schwarz inequality we have:

a1 2B0 (b+Cα)2

s P v V h2v (n N)2 = 2B0(b+Cα)2 r P v V h2v n2N .

After putting together both inequalities, we get:

(b+Cα)2 P v V h2v n2N

D 1 vv sv 1 ζv h1 ζv v

v V 2(b+Cα)2 h2v

D 1 vv sv 1 ζv h1 ζv v

= 2B0 (b+Cα) s X

v V ch2v cvh1 ζv v ,

Collaborative likelihood-ratio estimation over graphs

c = 2(b+Cα)2

n2N , cv = 16Λ2

D 1 vv sv 1 ζv .

Taking the partial derivative w.r.t. hv and setting it to zero, yields the optimal value:

h v = (1 ζv)cv

4Λ2 D 1 vv svn (b+Cα)2N

! 1 1+ζv . (88)

Then, after substitution:

ρ 2B0(b+Cα) s X

v V c(h v)2 cv(h v)1 ζv

= 2B0 (b+Cα)

v V (h v)2 c 2c (1 ζv)

= 2B0 (b+Cα)

If we deﬁne ζ = minv V ζv 1, smax = maxv V sv, we get:

ρ 2B0(b+Ca)2

v u u u t 2

4Λ2 D 1 vv svn (b+Cα)2N

h Λ2(b+Ca)2ζ i 1 1+ζ n ζ

D 1 vv 2 1+ζ

h Λ2(b+Ca)2ζ i 1 1+ζ n ζ

1 1+ζ max N 1 1+ζ s

D 1 vv 2 1+ζ .

Let us consider the last term. Since D 1 = (L+γI) 1 is positive deﬁnite, its diagonal elements D 1 vv are strictly positive. In addition, g(x) = x 2 1+ζ is a concave function for x 0 and ζ > 1, then, by applying Jensen s inequality, it follows that:

D 1 vv 2 1+ζ

! 2 1+ζ = 1

N tr (L+γI) 1 2 1+ζ =

(91) where {λ1,...,λn} are the eigenvalues of the Laplacian matrix L in an increasing order. Our hypothesis on the graph G and the properties L imply λi 0, for all i {1,...,N} and λ1 = 0. Moreover, the multiplicity of λ1 is equal to the number of connected components of G. Then, we can upper-bound the last term of Expr. 91 as follows:

N 1 γ + N #C(G)

N 1 λmin+ if #C(G) < N

1 γ if #C(G) = N

= βγ 1 +(1 β)λ 1 min+,

de la Concha, Vayatis and Kalogeratos

where #C(G) is the number of connected components of G, and λmin+ is the smallest nonzero eigenvalue of the Laplacian. In the last equality, we deﬁne β = #C(G)

N [0,1] and use the convention that (1 β)λ 1 min+ = 0 when #C(G) = N, which corresponds to a completely disconnected graph. With these elements, we can conclude:

TGΛ2(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

1 1+ζ max .

C.3 Proof Theorem 3

Proof Lemma 14 implies that HG is a (β,B)-Bernstein class of vector-valued functions with β = 1 and B = B0, and maxv V sup(x,x ) X |hfv(x,x )| B1. By Lemma 17, we have that there exists a sub-root function such that BR(HG,ρ) ϱ(ρ). Then, the hypotheses of Theorem 11 are satisﬁed, implying that with probability at least 1 δ, every f G satisﬁes:

v V Epz,v [ℓv(fv)(z) ℓv(rα v )(z)]

f2 v (rα v )2 (xv,i) 2 + α

f2 v (rα v )2 (x v,i) 2 1

i=1 [fv rα v ](x v,i)

+2C(202)B1ρ + 16B2 0C n N log 1

In particular for the minimum ˆf of Problem 34, we have that the term involving the empirical expectations is less than zero. As detailed in Appendix A.1, we can easily verify that PE(pα q) = Epz,v [ ℓv(rα v )(z)] 1

2, and by Expr. 32 PEα v (fv) = Epz,v [ ℓv(fv)(z)] 1

2. Then for ˆf we can rewrite Expr. 92 as:

h PE(pα v qv) PEα v ( ˆfv) i 2C(202)B1ρ + 16B2 0C n N log 1

Alternatively, after applying the second point of Lemma 12 we can conclude:

v V Epα v (y)

h ˆfv rα v i2 (y) 4C(202)B1ρ + 32B2 0C n N log 1

where Lemma 17 implies:

TGΛ2(b+Ca)2ζ 1 1+ζ n ζ

1+ζ N 1 1+ζ s

1 1+ζ max .

Collaborative likelihood-ratio estimation over graphs

Rohit Agrawal and Thibaut Horel. Optimal bounds between f-divergences and integral probability metrics. Journal of Machine Learning Research, 22(128):1 59, 2021.

Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012.

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3):337 404, 1950.

Francis Bach. Sum-of-Squares Relaxations for Information Theory and Variational Inference. Foundations of Computational Mathematics, 2024.

Francis Bach and Michael Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1 48, 2002.

Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497 1537, 2005.

Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23:2037 2060, 2013.

Moritz Beyreuther, Robert Barsch, Lion Krischer, Tobias Megies, Yannik Behr, and Joachim Wassermann. Obs Py: A Python Toolbox for Seismology. Seismological Research Letters, 81(3):530 533, 2010.

Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f, γ)-divergences: Interpolating between f-divergences and integral probability metrics. Journal of Machine Learning Research, 23(39):1 70, 2022a.

Jeremiah Birrell, Markos A Katsoulakis, and Yannis Pantazis. Optimizing variational representations of divergences and accelerating their statistical estimation. IEEE Transactions on Information Theory, 68(7):4553 4572, 2022b.

Michel Broniatowski and Amor Keziou. Minimization of φ-divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403 442, 2006.

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331 368, 2006.

Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4 (04):377 408, 2006

Imre Csisz ar. On topological properties of f-divergences. Studia Scientiarum Mathematicarum Hungarica, 2:329 -339, 1967.

Alejandro de la Concha, Nicolas Vayatis, and Argyris Kalogeratos. Online centralized non-parametric change-point detection via graph-based likelihood-ratio estimation. ar Xiv:2301.03011, 2023.

Alejandro de la Concha, Nicolas Vayatis, and Argyris Kalogeratos. Online non-parametric likelihood-ratio estimation by Pearson-divergence functional minimization. In Proceedings

de la Concha, Vayatis and Kalogeratos

of the International Conference on Artiﬁcial Intelligence and Statistics, pages 1189 1197, 2024.

Alejandro de la Concha, Nicolas Vayatis, and Argyris Kalogeratos. Collaborative nonparametric two-sample testing. In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics, 2025.

Aymeric Dieuleveut. Stochastic approximation in Hilbert spaces. Theses, Universit e Paris sciences et lettres, 2017.

Andr e Ferrari, C edric Richard, Anthony Bourrier, and Ikram Bouchikhi. Online changepoint detection with kernels. Pattern Recognition, 133:109022, 2023.

GNS Science. Geo Net Aotearoa New Zealand Earthquake Catalogue, 1970.

Jiayuan Huang, Arthur Gretton, Karsten Borgwardt, Bernhard Sch olkopf, and Alex Smola. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems, 2006.

Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. Statistical analysis of kernelbased least-squares density-ratio estimation. Machine Learning, 86(3):335 367, 2011.

Masahiro Kato and Takeshi Teshima. Non-negative bregman divergence minimization for deep direct density ratio estimation. In Proceedings of the International Conference on Machine Learning, pages 5320 5333, 2021.

Solomon Kullback. Information Theory and Statistics. Wiley, 1959.

Erich L. Lehmann and Joseph P. Romano. Testing statistical hypotheses. Springer Nature, Cham, Switzerland, 4 edition, 2022.

Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Mingyi Hong. On faster convergence of cyclic block coordinate descent-type methods for strongly convex minimization. Journal of Machine Learning Research, 18(184):1 24, 2018.

Song Liu, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 43:72 83, 2013.

Nan Lu, Tianyi Zhang, Tongtong Fang, Takeshi Teshima, and Masashi Sugiyama. Rethinking Importance Weighting for Transfer Learning, pages 185 231. Springer, 2023.

Andreas Maurer. The Rademacher complexity of linear transformation classes. In Learning Theory, pages 65 78. Springer, 2006a.

Andreas Maurer. The rademacher complexity of linear transformation classes. In G abor Lugosi and Hans Ulrich Simon, editors, Learning Theory, pages 65 78. Springer, 2006b.

Charles A. Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural Computation, 17(1):177 204, 2005.

Roula Nassif, Stefan Vlaski, Cedric Richard, Jie Chen, and Ali H. Sayed. Multitask learning over graphs: An approach for distributed, streaming machine learning. IEEE Signal Processing Magazine, 37(3):14 25, 2020a.

Collaborative likelihood-ratio estimation over graphs

Roula Nassif, Stefan Vlaski, C edric Richard, and Ali H. Sayed. Learning over multitask graphs part i: Stability analysis. IEEE Open Journal of Signal Processing, 1:28 45, 2020b.

Roula Nassif, Stefan Vlaski, C edric Richard, and Ali H. Sayed. Learning over multitask graphs part ii: Performance analysis. IEEE Open Journal of Signal Processing, 1:46 63, 2020c.

Duc Hoan Nguyen, Werner Zellinger, and Sergei Pereverzyev. On regularized radonnikodym diﬀerentiation. Journal of Machine Learning Research, 25(266):1 24, 2024.

Xuan Long Nguyen, Martin J. Wainwright, and Michael Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in Neural Information Processing Systems, 2008.

Xuan Long Nguyen, Martin J. Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847 5861, 2010.

Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, 2016.

Karl Pearson. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157 175, 1900.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2007.

Benjamin Rhodes, Kai Xu, and Michael U. Gutmann. Telescoping density-ratio estimation. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, 2020.

C edric Richard, Jos e Carlos M. Bermudez, and Paul Honeine. Online prediction of time series data with kernels. IEEE Transactions on Signal Processing, 57(3):1058 1067, 2009.

Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation, 10(5):1299 1319, 1998.

Danuel Sheldon. Graphical Multi-Task Learning. Technical report, Cornell University, 2008.

David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging ﬁeld of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83 98, 2013.

Alex J. Smola and Bernhard Sch okopf. Sparse greedy matrix approximation for machine learning. In International Conference on Machine Learning, pages 911 918, 2000.

Bharath K. Sriperumbudur, Kenji Fukumizu, and Gert R.G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12(70):2389 2410, 2011. URL http://jmlr.org/papers/v12/sriperumbudur11a.html.

de la Concha, Vayatis and Kalogeratos

Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In Conference on Learning Theory, pages 79 93, 2009.

Masashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul Buenau, and Motoaki Kawanabe. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems, 2007.

Masashi Sugiyama, Taiji Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. Least-squares two-sample test. Neural networks, 24:735 51, 2011a.

Masashi Sugiyama, Taiji Suzuki, Yuta Itoh, Takafumi Kanamori, and Manabu Kimura. Least-squares two-sample test. Neural Networks, 24(7):735 751, 2011b.

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge University Press, 2012.

Ameet Talwalkar, Sanjiv Kumar, and Henry Rowley. Large-scale manifold learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2008.

Alexander Tartakovsky, Igor V. Nikiforov, and Michele Basseville. Sequential Analysis: Hypothesis Testing and Changepoint Detection. Taylor & Francis, CRC Press, 2014.

Christopher Williams and Matthias Seeger. Using the nystr om method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2000.

Stephen J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1): 3 34, 2015.

Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in Neural Information Processing Systems, 2011.

Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural Computation, 25(5):1324 1370, 2013.

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8(5):561 596, 2007.

Niloofar Youseﬁ, Yunwen Lei, Marius Kloft, Mansooreh Mollaghasemi, and Georgios C. Anagnostopoulos. Local rademacher complexity-based learning guarantees for multi-task learning. Journal of Machine Learning Research, 19(38):1 47, 2018.

Werner Zellinger, Stefan Kindermann, and Sergei V Pereverzyev. Adaptive learning of density ratios in rkhs. Journal of Machine Learning Research, 24(395):1 28, 2023.

Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved nystr om low-rank approximation and error analysis. In International Conference on Machine Learning, pages 1232 1239, 2008.

Yu Zhang and Qiang Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021.