# distributed_kerneldriven_data_clustering__7f008749.pdf

Journal of Machine Learning Research 25 (2024) 1-39 Submitted 5/23; Revised 10/24; Published 11/24

Distributed Kernel-Driven Data Clustering

Ioannis Schizas IOANNIS.D.SCHIZAS.CIV@ARMY.MIL US Army Combat Capabilities Development Command Army Research Lab Aberdeen Proving Ground, MD 21005, USA

Editor: Dan Alistarh

A novel fully distributed joint kernel learning and clustering framework is derived which is capable of determining clustering conﬁgurations in an unsupervised manner. Semideﬁnite programming is utilized to quantify closeness of candidate kernel similarity matrices to a block diagonal structure of certain rank. Utilizing difference of convex functions and block coordinate descent a recursive algorithm is derived that determines jointly a proper kernel similarity matrix and clustering factors. Reformulating the involved semideﬁnite programs in a separable way we build on the alternating direction method of multipliers, to construct a fully distributed scheme that enables joint kernel learning and clustering in ad hoc networks via collaborating neighboring agents. Convergence claims establish that the proposed algorithmic framework returns bounded similarity kernel updates promoting a block diagonal structure. Detailed numerical examples utilizing both synthetic and real data demonstrate that the distributed novel approach can achieve clustering performance that gets close or even exceeds the one achieved by existing centralized alternatives. Keywords: Distributed learning, kernels, clustering, unsupervised learning, optimization

1. Introduction

Clustering data vectors into different groups sharing similar properties has been extensively studied as an unsupervised learning technique when data labels are not available. An essential aspect in data clustering is picking a proper similarity metric. For instance in K-Means (Li and Guo, 2018; Lloyd, 1982; Oliva et al., 2013), a workhorse in data clustering, each different cluster is represented by a centroid point and the data are assigned to the cluster whose centroid is closest with respect to a pre-selected distance metric. Different variants of matrix factorization have also been used to explore the clustering problem given a known data similarity matrix (Cai et al., 2010; Huang et al., 2013; Trigeorgis et al., 2016; Wang and Zhang, 2012). To deal with non-linear settings kernel-based methods have been proposed that are capable of unveiling data vector correlations in higher-dimensional spaces. Kernel target alignment (Cortes et al., 2012; Cristianini et al., 2001; M uller et al., 2018) are such popular supervised approaches wherein the suitable kernel matrices are identiﬁed by ﬁnding the alignment or the normalized inner product between the kernel correlation matrices and the correlation of the class labels. Other kernel learning techniques for classiﬁcation and regression problems rely on convex optimization (Ghari and Shen, 2020; Hoi et al., 2013; Jin et al., 2010; Motai, 2014), though they are still supervised or semi-supervised in nature. Unsupervised approaches have been recently proposed to jointly construct a proper kernel similarity matrix while performing data clustering. The unsupervised approach in Ren and Sun (2020)

c 2024 I. Schizas.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0669.html.

relies on building proper graph similarity matrices to model correlations among the data vectors via an afﬁne weight strategy while preserving the data structure via proper constraints. Further, relying on tensor factorization the method in (Ren et al., 2020) offers a more computationally expensive option while improving clustering in a variety of different datasets. Another unsupervised joint kernel learning and clustering approach in (Malhotra and Schizas, 2020) utilizes a sparsity regularized non-negative matrix factorization along with eigenvalue maximization to ﬁnd proper kernel covariance matrices that facilitate data clustering.

All aforementioned approaches are centralized in the sense of requiring a central processor to operate or multiple processing units in a tree formation (master-slave). Therefore they are not operational in ad hoc architectures involving multiple sensing/processing agents/units. For instance, distributed sensing units could correspond to accelerometers (e.g., smartphones) mounted on humans that engage in different activities, e.g., running, walking, jumping (Micucci et al., 2017) and so on based on the situation they are facing. Clustering the sensing units based on the activity they are monitoring is essential in facilitating situational awareness in large gatherings or tactical applications assessing how different groups of people behave. Distributed clustering in such situations is desirable to enable scaling as the number of agents increases.

Distributed data clustering techniques relying on splitting centralized K-means in localized processing tasks have been proposed (Chen et al., 2016; Oliva et al., 2013; Qin et al., 2016; Tsapanos et al., 2015) with some of them requiring a fusion center (master processing unit) (Chen et al., 2016; Li and Guo, 2018; Tsapanos et al., 2015). These approaches, as the centralized K-means approach rely on a preset data similarity metric which can considerably limit the clustering performance. On the other end, existing distributed kernel learning techniques rely on distributed optimization techniques to derive localized learning tasks and perform supervised learning tasks, i.e., regression and classiﬁcation, (Bouboulis et al., 2017; Hong and Chae, 2021; Shin et al., 2018).

The main goal of this work is to derive a fully distributed algorithmic framework for joint kernel learning and clustering in an unsupervised manner. Existing approaches in clustering and kernel learning as mentioned earlier are centralized or rely on master-slave multiprocessing architectures (Malhotra and Schizas, 2020; Ren and Sun, 2020; Ren et al., 2020). Such approaches are not applicable in ad hoc multi-agent architectures where there is no central processing unit. Building on semideﬁnite programming (SDP) (Boyd and Vandenberghe, 2004), difference of convex functions (Tao and An, 1997) and block coordinate descent strategies (Tseng, 2001) we ﬁrst obtain a centralized joint kernel learning and clustering formulation. Different from existing approaches, the novel minimization formulation is amenable to a separable reformulation which will enable the derivation of local learning tasks across the sensing units. Despite the fact that a reformulation is needed to obtain a separable non-equivalent formulation from the centralized one, due to single-hop connectivity in the network of sensing units, it is established that the centralized and distributed optimal solutions share block diagonal structure and equal rank which is crucial for the clustering task. The proposed distributed and centralized formulations are not equivalent from an optimization point of view, but their optimal solutions share a block diagonal structure.

Speciﬁcally, the alternating direction method of multipliers (ADMM) is employed to split the involved SDP programs in a set of local SDP tasks that can be tackled using local information at every sensing agent. A novel combination of ADMM, SDP programming and block coordinate descent is employed to devise a fully distributed algorithm that facilitates proper kernel learning and effective unsupervised clustering. Our work contributions are summarized as follows:

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

1. Derivation of a SDP-based framework quantifying closeness of candidate kernel mappings to a block diagonal structure that facilitates joint kernel learning and factorization-based clustering.

2. A novel minimization formulation that combines block coordinate descent and semideﬁnite programming that results a novel centralized joint kernel learning and clustering approach.

3. Derivation of pertinent localized SDP problems, where the alternating direction method of multipliers is being employed in the separable kernel learning and clustering formulation that leads to a fully distributed kernel learning and clustering approach that can be utilized in ad hoc multi-agent networks. It turns out that the communication costs among neighboring agents are linear in terms of the number of clusters, the size of the neighborhood and the kernel dictionary.

4. Convergence of the novel framework is established showing that the kernel iterates move towards a block diagonal structure facilitating data clustering.

5. Extensive numerical tests demonstrate the clustering accuracy advantage of the novel framework over existing centralized alternatives.

The problem setting and preliminary concepts are provided in Sec. 2, while explaining the idea of data clustering via a proper block diagonal kernel covariance matrix construction using an ad hoc multi-agent network. In Sec. 3 the novel SDP-based kernel learning and clustering formulation is derived, while a centralized algorithm is provided using block coordinate descent and difference of convex functions, with convergence to a ﬁnite bound being established. Sec. 4 builds a separable SDP-based formulation which is further tackled relying on ADMM that results localized kernel learning and clustering tasks that can be addressed locally at each agent which via collaboration with neighboring agents can tackle the global minimization formulation. Convergence analysis advocates that the novel algorithm learns a kernel data similarity matrix approaching a block diagonal structure that has the potential to reach the clustering performance of its centralized counterpart. Extensive numerical tests are performed in Sec. 5 for the novel algorithm along with state-of-the-art clustering methods using synthetic and real datasets and evaluating clustering accuracy, the normalized mutual information (NMI) and purity.

2. Problem Setting and Preliminaries

Consider P agents spatially scattered across a ﬁeld, with the ith unit acquiring signal measurements xi(t) across a time horizon of duration t [0, T 1], i.e.,

xi(t) = fq(i)(sq(i)(t)) + wi(t), , i = 1, . . . , P (1)

where fq(i)( ) represents a mapping between an underlying source sq(i)(t) and the measurement xi(t), while wi(t) denotes spatially uncorrelated sensing noise. Sensing assumption: It is assumed that among the Q underlying sources, each unit observes one of them, i.e., unit i observes source q(i) where q(i) {1, . . . , Q} is an underlying unknown mapping matching sources with observations. Each of the vectors xi contains information about a

speciﬁc source namely q(i) {1, . . . , Q} (thus belongs to a speciﬁc cluster). Further, the mapping function fq( ) is not available and nonlinear. Each agent groups its measurements in vector xi := [xi(0), . . . , xi(T 1)]T with T denoting the number of samples.

For the example given in the Introduction where different sensing units monitor the physical activity a person is engaging to (running, walking and so on), it is reasonable to assume that there is one dominant activity that will be measured from the accelerometers. Similarly, in hyperspectral imaging data (Sal, 2021) (tested in Numerical Simulations) the majority of pixels acquire information about multiple materials but only one has a dominant effect in the pixel value. Thus, the assumption that each sensing unit detects one source can hold approximately in several scenaria.

The sensing noise corresponds to thermal noise that is caused locally at each sensing unit by the sensing electronics equipment and it is not correlated with the source signals. Sensing units are different and placed in different spatial locations. Therefore, the thermal sensing noise caused by the individual sensing electronics at each unit can safely be assumed to be uncorrelated across different units see e.g., (Peng et al., 2020).

For simplicity in exposition, we consider temporarily an afﬁne mapping fq(s) = αq s+βq and uncorrelated sources {sq(t)}Q q=1, then the data covariance matrix Σx := E[(Xt X)(Xt X)T ], with Xt := [x1(t), . . . , x P (t)]T and X := E[Xt] has entries (E denotes the expectation operator)

[Σx]i,i = E[(xi(t) xi)(xi (t) xi )] = αqαq δ(q q )σ2 sq + σ2 wδ(i i ), (2)

where sq(t) and sq (t) denote the source signals sensed in measurements xi(t) and xi (t) on sensing units i and i , σ2 w and σ2 sq refer to the variances of wi(t) and sq(t) respectively, while δ(q q ) equals 1 if q = q and zero otherwise. Notice that if agents i and i sense correlated source signals then the corresponding covariance entry in (2) will be nonzero, otherwise it will be equal to zero. It can be concluded that sensing units observing the same source signal sq(t) are correlated, hence by applying proper row and column exchanges to the covariance matrix Σx we can transform it into a block diagonal matrix having Q diagonal blocks, with each block clustering together the sensing units observing the same source signal. It should be emphasized that afﬁne mappings are not required by the novel framework; they just serve as an illustrative example to show the importance of a block diagonal structure and how diagonal blocks can be mapped to different groups of measurements with each group sensing the same source.

Clearly, the block diagonal structure of the data covariance matrix Σx can be used to cluster the sensed data according to their source content by using the indices of the diagonal blocks. Fig. 1 (left, lower center) indicate how measurements observing the same source, translate to a diagonal block in a properly deﬁned similarity matrix (in the linear case Kx = Σx). Each of the covariance diagonal blocks, can be considered to be of rank one since sensed data from the same source-group (diagonal block) depend on the same source signal, while assuming that sensing noise has sufﬁciently low variance. Noise assumption: Essentially the noise variance σ2 w should be much smaller than the variances σ2 sq associated with source sq(t) for q = 1, . . . , Q. This way each of the diagonal blocks is mainly affected by the source signal, while the noise contribution is negligible, resulting diagonal blocks of rank approximately equal to 1. There are several noise reduction techniques in signal processing, including adaptive ﬁltering and Bayesian inference, that can effectively suppress noise variance and enhance the signal components of the sense data, see e.g., Vaseghi (2008). Thus, in the presence of Q sources and an afﬁne model in (1), the covariance matrix will contain Q, rank-1 diagonal blocks. A block diagonal similarity matrix will further facilitate data clustering by pertinent sparse

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

factorization of Kx, where the nonzero entries of the recovered sparse factors M, N [colored entries in Fig. 1 (right)] will reveal the location of the diagonal blocks and therefore the indices of units sensing the same source. When the unknown mapping functions fq( ) are non-linear in (1) and the source signals are not uncorrelated, then the resulting covariance matrix in (2) may not necessarily give rise to a block diagonal structure. To bypass these challenges, we will build on kernel-based data transformations to construct a pertinent kernel covariance matrix Kx which is as close as possible to i) having a block diagonal structure; and ii) having rank Q. Note that the covariance matrix in Σx in (2) is a special case of a kernel covariance matrix when using linear kernels, i.e., Kx = Σx. Depending on the sensed data, different kernel functions should be used to give rise to a block diagonal structure of rank Q that will facilitate data clustering. We will construct an optimization framework that determines a proper kernel similarity matrix, via an optimal convex combination of kernels from a kernel dictionary of available mappings D := {A1 x, . . . , AB x } that lead to a desirable block diagonal kernel covariance structure [see Fig. 1 (upper center)]. The dictionary kernel matrices Ab x, are assumed to be pre-speciﬁed, and they can be evaluated from the available data using e.g., Gaussian, polynomial and other well-established kernel mappings (Ren and Sun, 2020; Ren et al., 2020).

2.1 Spatially distributed sensing agents

The sensing agents are spatially scattered in the observed ﬁeld. A communication graph G is utilized to model the spatial conﬁguration and connectivity of the units. Each agent corresponds to a graph node in set V := {1, . . . , P}, while the graph edges in set E correspond to active communication links. Connectivity in the graph is summarized by the adjacency matrix A whose (i, j)th entry is equal to 1 if units i and j communicate, otherwise is zero. The adjacency matrix is symmetric, while Nj is the single-hop neighborhood of unit j, i.e., the set of units which communicate directly (singlehop) with unit j. An example with Q = 3 and P = 10 is provided in Fig. 1. The communication

Figure 1: Joint kernel learning and clustering framework.

graph is assumed to be connected, thus there is a communication path (possibly having multiple edges) from one node to any other node. Connectivity assumption: Further, it is assumed that there is at least one sensing unit j whose single-hop neighborhood Nj consists of units with each of these units observing one source, but

when considering the set of these neighborhood observations {xj }j Nj they cover all the Q sources present in the ﬁeld., i.e., the measurements {xi(t)}i Nj in neighborhood Nj cover all source signals {sq(t)}Q q=1; we denote this set of units as SQ. An example is provided in Fig. 1 (left) where SQ contains unit 3 sensing s1, though note that unit 3 has single-hop neighboring units 4 and 10 sensing sources s2 and s3, as well as units 1 and 2 sensing source s1. Thus, the set of observations x3, x1, x2, x10 and x4 cover all Q = 3 sources present. The latter assumption can be achieved using certain units that have transceivers with longer communication range to ensure these units can receive information for a number of neighboring units whose measurements contain information about all Q sources. The objective is to allow each unit cluster their measurements according to their unknown source content by exchanging information only with their single-hop neighboring units in set Nj. As we will explain in detail later (Sec. 4.3), satisfying this assumption does not require knowing the data clustering conﬁguration.

3. SDP Kernel Learning and Clustering

The idea is to construct, using the available data, a kernel similarity matrix Kx that is block diagonal with a rank equal to Q, the number of underlying sources of interest. The diagonal blocks once identiﬁed will facilitate identifying the data clusters via proper sparse matrix factorization. A block diagonal matrix Kx, with Q diagonal blocks of rank 1 can be factorized as Kx = MNT , where the P Q matrix factors M, N have at most one nonzero entry across each of their rows [see Fig. 1 (right)]. A pertinent distance measure quantifying how far Kx is from a block diagonal structure is the following (Malhotra and Schizas, 2020)

F(Kx) = Kx MNT 2 F + ν PP i=1 [ Mi,: 1 Mi,: 2]

+ ν PP i=1 [ Ni,: 1 Ni,: 2] + ξ M N 2 F , (3)

where || ||1 and || ||2 are the ℓ1 and ℓ2 norms of a vector, while ν, ξ are nonnegative regularization parameters controlling the sparsity of the rows of M, N, and similarity of factors M and N, respectively. Mi.:, Ni.: refer to the ith row of factors M and N, respectively. Penalty term ||Mi,:||1 ||Mi,:||2 0 will be zero only when Mi,: has one nonzero entry or is equal to an allzeroes vector. The ℓ1 ℓ2 term has proven advantages over many of the other sparsity metrics; (see e.g., Yin et al., 2015). Note that function F(Kx) 0, attains its lowest value F(Kx) = 0 if Kx has a block diagonal structure with each diagonal block having rank one, or Kx has rank Q. In these two aforementioned cases ||Mi,:||1 = ||Mi,:||2 (and ||Ni,:||1 = ||Ni,:||2), i.e., the ith row Mi,: will have at most one nonzero entry. A good similarity matrix Kx is to have rank equal to Q > 1, where Q corresponds to the number of diagonal blocks that need to be present in Kx equal to the Q different groups of units sensing a different source. Our idea is to induce rank Q by imposing constraints on the magnitude of the Q largest eigenvalues of Kx. Thus, it is important to ensure that the matrix has Q strong eigenvalues. To this end, we propose maximizing the Qth largest eigenvalue, while introducing a mechanism that penalizes matrices of rank exceeding Q. We resort to semideﬁnite programming (SDP) and linear matrix inequalities (LMIs) (Boyd and Vandenberghe, 2004) to enforce the aforementioned requirements. Further, the formulation in (3) is not amenable to distributed implementations. SDP will provide an effective formulation that can be further separated in local tasks across the sensing units. The following minimization formulation is

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

arg min µ w + v PP ℓ=1 [ Mℓ,: 1 Mℓ,: 2]

+ v PP ℓ=1 [ Nℓ,: 1 Nℓ,: 2] + ω ψ + ξ θ

M PM b=1 αb Ab x

" θ QIP M N (M N)T 1 QIQ

P IP PB b=1 αb Ab x M NT PB b=1 αb Ab x M NT T 1

0.5 (NT M + MT N) w IQ, w 0,

{αb 0}B b=1 and P b αb = 1, (4)

where corresponds to a matrix inequality, whereas indicates entry-wise inequality. The associated optimization variables are the P Q matrix factors M, N, and the scalar variables α := {αb}B b=1, w, ψ and θ; IQ denotes an identity matrix of size Q Q. The objective of the minimization formulation in (4) is to determine a proper convex combination PB b=1 αb Ab x [see Fig. 1 (center)] of kernel covariances matrices available in a dictionary D := {Ab x}B b=1 to minimize the block diagonal metric in (3). The penalty terms µ, v, ω, ξ > 0 are user-set.

Note that the kernel similarity matrix Kx is set as the convex combination, Kx := PB b=1 αb Ab x, of the kernel matrices available in the dictionary D. To satisfy the convexity of the Kx := PB b=1 αb Ab x, the last two constraints in (4), forming a simplex, are employed. The ﬁrst three matrix inequalities in (4) are equivalent (using the Schur complement; details are provided in Apdx. A) to

b=1 αb Ab x M MT , M N 2 F θ,

b=1 αb Ab x M NT

respectively. Thus, the second and third inequalitiers in (4) combined with the second and third terms in the cost in (4) ensure the minimization of the distance metric in (3).

Further, the fourth LMI in (4) combined with the ﬁrst term in the cost of (4) ensure that both M and N are full-column rank equal to Q. Lastly, the ﬁrst LMI in (4) is equivalent to PB b=1 αb Ab x MMT which combined with the fourth inequality in (4) ensure that PB b=1 αb Ab x will also be of rank equal to Q (details in Apdx. A). Thus, it is established (Apdx. A) that the formulation in (4) promotes the selection of dictionary kernel covariances that give rise to a block diagonal kernel PB b=1 αb Ab x of rank Q, as long as the dictionary has kernel members supporting that.

Proposition 1 The formulation in (4) has an optimal solution involving coefﬁcients {α b} that result a kernel covariance matrix PB b=1 α b Ab x which is block diagonal and has rank Q, as long as the kernel dictionary D := {Ab x}B b=1 contains a convex linear combination of its kernel elements which is block diagonal and has rank Q.

Remark: It should be emphasized that the dictionary elements Ab x do not have to be block diagonal and the size of the blocks does not have to be known. Only the number of sources Q is assumed to be known in the present work. The proposed formulation in (4) (and related algorithms later on) aim to construct a convex combination Kx := PB b=1 αb Ab x using the elements of dictionary D to construct a Kx as close to a block diagonal as possible.

3.1 Block Coordinate Descent and Difference of Convex Form

The formulation in (4) is nonconvex hindering the utilization of efﬁcient optimization techniques. The main reasons are: i) the cost function contains concave terms, namely Mℓ,: 2 and Nℓ,: 2 which give rise to a difference of convex functions cost; and ii) several of the matrix inequality constraints contain the nonlinear terms M NT or NT M resulting nonconvex matrix inequality constraints. To work around these challenges we resort i) to a difference of convex formulation; and ii) a block coordinate descent framework where we optimize with respect to M, α, w, ψ, θ while keeping ﬁxed the factor N to its more recent update during iteration τ, namely Nτ. In detail, when updating M, α the nonlinear terms NT M and M NT will be replaced with Nτ MT and M NT τ using the most recent update Nτ; while when updating N the nonlinear terms M NT and NT M will be replaced with Mτ+1 NT and NT Mτ+1 using update Mτ+1. Notice that the cost in (4) contains the terms PP ℓ=1 [ Mℓ,: 1 Mℓ,: 2] and PP ℓ=1 [ Nℓ,: 1 Nℓ,: 2]. If we set H(M) = PP ℓ=1 Mℓ,: 1 and G(M) = PP ℓ=1 Mℓ,: 2 then the second and third terms in (4) is essentially the difference of two convex functions, i.e, H(M) G(M). To this end, we will resort to the difference of convex functions approach (Tao and An, 1997). The algorithm iteratively computes an afﬁne majorization of the function G(M), where the majorization during the τth iterate is given as, trace MT , n G(M)

M |M=Mτ o . The cost in (4) can be numerically minimized by utilizing block coordinate descent (Tseng, 2001) along with the difference of convex functions recursive approach. During iteration τ + 1 the proposed algorithm involves the following steps Step 1a: Fix factor N to most recent update Nτ and minimize (4) with respect to (wrt) the rest of the variables. The gradient of Mℓ,: 2 is evaluated at Mτ,κ which is obtained during difference of convex algorithm (DCA) iteration κ

mτ,κ ℓ Mℓ,: 2 M=Mτ,κ = Mτ,κ,ℓ,: 2 1 MT τ,κ,ℓ,:, (5)

where MT τ,κ,ℓ,: is the ℓth row of update Mτ,κ. Step 1b: Solve the majorized version of (4) after replacing N = Nτ, and using the gradient in (5) to linearize the concave terms Mℓ,: 2, i.e. we obtain the SDP,

{Mτ,κ+1, ατ,κ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1} arg min µ w

+ ω ψ + ξ θ + v PP ℓ=1 h Mℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 MT ℓ,: i

s. to IQ MT

M PB b=1 αb Ab x

" θ QIP M Nτ (M Nτ)T 1 QIQ

P IP PB b=1 αb Ab x M NT τ PB b=1 αb Ab x M NT τ T 1

0.5 (NT τ M + MT Nτ) w IQ, w 0,

{αb 0}B b=1 and P b αb = 1. (6)

Steps 1a and 1b are repeated until Mτ,κ+1, ατ,κ+1 converge (as will be established in Prop. 3), which algorithmically will be veriﬁed by checking when the update amounts Mτ,κ+1 Mτ,κ F

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

and ατ,κ+1 ατ,κ F drop below a user-set threshold ϵ1. We denote the converging entities as Mτ+1 and ατ+1 which will be ﬁxed next in (4) to update the factor N using the following two steps. Step 2a: Fix factor M and α to the most recent update Mτ+1 and ατ+1 respectively, and minimize (4) with respect to the rest of the variables. To this end, we calculate the gradient of Nℓ,: 2 evaluated at Nτ,κ that is obtained during DCA iteration κ = 0, 1, 2, . . . ,

nτ,κ ℓ Nℓ,: 2 N=Nτ,κ = Nτ,κ,ℓ,: 2 1 NT τ,κ,ℓ,:. (7)

Step 2b: Solve the majorized version of (4) after replacing M = Mτ+1 and α = ατ+1, and using the gradient in (7) to linearize the concave terms Nℓ,: 2 for ℓ= 1, . . . , P, i.e.,

{Nτ,κ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1} arg min µ w + ω ψ

+ ξ θ + v PP ℓ=1 h Nℓ,: 1 Nτ,κ,ℓ,: Nτ,κ,ℓ,: 2 NT ℓ,: i (8)

" θ QIP Mτ+1 N (Mτ+1 N)T 1 QIQ

P IP PB b=1 ατ+1,b Ab x Mτ+1 NT PB b=1 ατ+1,b Ab x Mτ+1 NT T 1

0.5 (NT Mτ+1 + MT τ+1 N) w IQ, w 0.

Steps 2a and 2b are repeated until Nτ,κ+1 converges (as will be established in Prop. 3), which algorithmically will be veriﬁed by checking when the update amount Nτ,κ+1 Nτ,κ F drops below a user-set threshold ϵ1. We denote the converging factor as Nτ+1. The recursive method for numerically solving (4) is tabulated in detail as Alg. 1. Notice that a similar breaking condition is also utilized to terminate the outer block coordinate descent loop (with iteration index τ) using the user-deﬁned threshold ϵ2. Kmax denotes the user-deﬁned maximum number of DCA iterations employed during coordinate descent iteration τ, in case it takes too long for the breaking conditions in lines 7 or 15 of Alg. 1 to be satisﬁed. We set this value equal to Kmax = 20 for the numerical tests conducted in the paper ensuring convergence.

3.2 Convergence

Next we demonstrate that Alg. 1 returns updates {Mτ, ατ, wτ, ψτ, θτ} that result a non-increasing sequence of cost values in (4). First, we establish in Apdx. B that

Lemma 2 The cost functions in (4), (6) and (8) are bounded below by a ﬁnite negative number. A negative value of the cost in (4), (6) and (8) implies that factor Mτ has full rank Q.

Using Lemma 2 it is further established in Apdx. C that

Proposition 3 The iterates {Mτ, Nτ, ατ, wτ, ψτ, θτ} produced by Alg. 1 result a non-increasing cost function value sequence Jτ( ) in (4) which converges to a ﬁnite value as τ , i.e.,

lim τ Jτ(Mτ, ατ, wτ, ψτ) = J < . (9)

Therefore using continuity of the cost in (4) the iterates {Mτ, Nτ, ατ, wτ, ψτ, θτ} converge too.

Algorithm 1 Centralized Joint Kernel Selection and Clustering (CKC)

1: Initialize N0,0 randomly, and kernel weights α0,0,j = 1

B j 1, ..., B. Set penalty coefﬁcients ω, ξ, µ, v.

2: The dictionary kernel matrices {Ab x}B b=1 are normalized to have unit trace. 3: for τ = 0, 1, 2... do 4: for κ = 0, 1, 2...Kmax 1 do 5: Form the gradients {mτ,κ ℓ }P ℓ=1 using (5). 6: Solve SDP formulation in (6) to obtain updates {Mτ,κ+1, ατ,κ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1}. This can be done using interior point methods, (e.g., Boyd and Vandenberghe, 2004; Grant and Boyd, 2014). 7: if ( Mτ,κ+1 Mτ,κ F + ατ,κ+1 ατ,κ F < ϵ1) then 8: Mτ+1 = Mτ,κ+1 and ατ = ατ,κ+1. 9: Break. 10: end if 11: end for 12: for κ = 0, 1, 2...Kmax 1 do 13: Form the gradients {nτ,κ ℓ }P ℓ=1 using (7). 14: Solve SDP formulation in (8) to obtain updates {Nτ,κ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1}. This can be done using interior point methods. 15: if ( Nτ,κ+1 Nτ,κ F < ϵ1) then 16: Nτ+1 = Nτ,κ+1. 17: Break. 18: end if 19: end for 20: if Mτ+1 Mτ F + Nτ+1 Nτ F + ατ+1 ατ F < ϵ2 then 21: Break. 22: end if 23: end for

Remarks: Lemma 2 and Prop. 3 demonstrate that if the limit value J, to which iterates Jτ( ) converge to, is negative then the iterates Mτ will be full rank Q. For sufﬁciently large µ the updates ωτ,κ for sufﬁciently large τ should be strictly positive which combined with the LMI 0.5 (MT N + NT M) w IQ will return factors Mτ, Nτ that have rank equal to Q (see also proof in Apdx. B). Further, the iterates ψτ,κ are pushed towards zero as much as possible, and since P b ατ,κ+1,b Ab x Mτ,κ+1NT τ 2 F ψτ,κ+1 that further implies that the iterates ατ,κ+1,b are selected such that P b ατ,κ+1,b Ab x get as close as possible to Mτ,κ+1NT τ that has rank Q, while from the ﬁrst LMI in (4) P b ατ,κ+1,b Ab x has rank at least equal to Q. Finally, the terms Mτ,κ+1,ℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 (Mτ,κ+1,ℓ,:)T in (6) can only attain the lowest value of zero only if the ℓth row Mτ,κ+1,ℓ,: contains at most one nonzero entry. Similar arguments can be applied for row updates Nτ,κ+1,ℓ,:. Thus, Mτ+1, Nτ+1 factor iterates are pushed towards a block diagonal structure. These convergence claims hold as long as Kmax is sufﬁciently large to ensure that the breaking conditions on lines 7 and/or 15 in Alg.1 are met. Although, an analytical lower bound could not be obtained for Kmax, a value of 20 meets the requirements for the numerical tests considered in this work.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

4. Distributed Kernel-Based Clustering

The SDP formulations in (6) and (8) require storage of the dictionary kernels Ab x and optimization variables in one central location. Numerically solving (6) or (8) via interior point methods, (e.g., Boyd and Vandenberghe, 2004; Grant and Boyd, 2014), has a complexity of O(P 4) where P is the number of agents. In the network setting considered in Sec. 2 each agent j acquires a measurement xj(t) and can directly exchange information with its single-hop neighbors. The objective is to derive from (6) and (8) a separable minimization formulations that can be solved in a distributed fashion by solving small scale local SDP problems across the sensing agents and exchanging information with single-hop neighbors. Such a separable formulation will enable the implementation of joint kernel learning and clustering in ad hoc distributed settings. To this end, we introduce local optimization variables that will help reformulate (6) and (8) in a separable fashion. At sensing unit j, let αj := {αj b}B b=1 denote the local version of the kernel coefﬁcient variables {αb}B b=1. Similarly ψj, wj and θj are local replicas of the variables ψ, w and θ, respectively. Further, we introduce the local |Nj| Q factor matrix

MNj := h (Mj j,:)T , (Mj j ,:)T , . . . , (Mj j ,:)T i T , (10)

where the indices j, j , j belong to Nj. Thus, MNj contains a local (at unit j) version Mj j ,: of the row Mj ,: in central P Q factor M in (6) for j Nj. Similarly, we can deﬁne a local version NNj for factor N stored at agent j. Unit j can exchange information directly with other units j Nj in its single-hop neighborhood, therefore it can directly calculate the entries of dictionary kernels Ab x(j , j ) for j , j j Nj which form a (|Nj| + 1) (|Nj| + 1) submatrix of Ab x denoted [Ab x]Nj [see Fig. 2]. All units in the distributed setting utilize the same pre-determined dictionary D = {Ab x}B b=1. This can be hardwired in the local sensing units. The only task that local units need to perform with regard to the kernel elements in D is to extract the local submatrices [Ab x]Nj, which can be formed directly after keeping the rows and columns of Ab x with indices in neighborhood Nj. Similarly, at unit j only the entries with row and column indices from {j} Nj can be evaluated directly via single-hop communications in H1 τ := M NT τ and H2 τ := 0.5 (NT τ M + MT Nτ) on the left hand side (lhs) of the third and fourth LMIs in (6), respectively. We denote this (|Nj| + 1) (|Nj| + 1) local submatrices of H1 τ and H2 τ, which are calculated locally at unit j as

[H1 τ(Nτ)]Nj = [M (Nτ)T ]Nj (11)

[H2 τ(Nτ)]Nj = 0.5 P j Nj {j}[(Nj τ,j ,:)T Mj j ,: + (Mj j ,:)T Nj τ,j ,:],

where Nj τ,j ,: a local version of row j of factor update Nτ contained in [Nτ]Nj at unit j.

4.1 Separable SDP Formulation

The ﬁrst two LMI constraints in (6) involve the central factor variables M, Nτ, as well as all the entries of dictionary kernels {Ab x} which are not available in a single location in the distributed setting considered here. We substitute these two LMI constraints with the following set of local

G1,j := IQ (MNj)T

MNj PB b=1 αj b[Ab x]Nj

Gτ,2,j(Nτ) :=

" θj QI|Nj|+1 [M Nτ]Nj ([M Nτ]Nj)T 1 QIQ

for j = 1, . . . , P which utilizes locally available entities MNj, [Nτ]Nj and [Ab x]Nj and using the Schur complement ensures PB b=1 αj b[Ab x]Nj MNj MT Nj and [M Nτ]Nj 2 F θj. Using similar reasoning the third LMI in (6) is replaced with the following set of local LMI constraints for j = 1, . . . , P

Gτ,3,j(Nτ) :=

|Nj|+1I|Nj|+1,

b=1 αj b[Ab x]Nj [H1 τ(Nτ)]Nj B X

b=1 αj b[Ab x]Nj [H1 τ(Nτ)]Nj

|Nj|+1I|Nj|+1

which via the Schur complement is equivalent to PB b=1 αj b[Ab x]Nj [H1 τ(Nτ)]Nj 2 F ψj. Similarly the fourth LMI is replaced with [H2 τ(Nτ)]Nj wj for j SQ with SQ denoting the set of those units j whose neighbors in Nj sense all Q sources. The entry-wise constraint (in)-equalities are substituted with the local versions wj 0, {αj b 0}, P b αj b = 1 for j = 1, . . . , P. Starting from the cost function in (6) we replace the global variables w, ψ and θ with the the average of their local versions, i.e., |SQ| 1 P j SQ wj, P 1 PP j=1 ψj and P 1 PP j=1 θj, respectively, with |SQ| denoting the cardinality of set SQ. In the summation term of cost (6) factors Mℓ,: are replaced with their local representation Mℓ ℓ,:, while Mτ,κ,ℓ,: are replaced with local update Mℓ τ,κ,ℓ,:. We also introduce constraints to ensure equality among the local versions of the variables representing the coefﬁcients {αb}B b=1, among the local versions of the factor rows {Mj,:}P j=1 and

{Nj,:}P j=1, respectively. Auxiliary variables βj,j := {βj,j b }B b=1, and row vectors Zj,j , Θj,j are introduced in the following local equality constraints:

{αj b = βj,j b , αj b = βj,j b }B b=1, for j = 1, . . . , P, j Nj (15)

Mj j,: = Zj,j , Mj j,: = Zj,j , for j Nj, j = 1, . . . , P, (16)

Nj j,: = Zj,j , Nj j,: = Zj,j , for j Nj, j = 1, . . . , P. (17)

The constraints in (15) result that αj b = αj b for j Nj and j = 1, . . . , P; since the communication graph of the sensing units is connected this further implies that α1 b = α2 b = . . . = αP b for b = 1, . . . , P. Thus, all the local coefﬁcient variables will be identical since they all represent the centralized set of coefﬁcients αb. Similarly, the equalities in (16) and (17) ensure that Mj j,: = Mj j,: and Nj j,: = Nj j,: for j Nj and j = 1, . . . , P which combined with the communication graph

connectivity will ensure that all local row vectors Mj j,: will be identical for j Nj since they rep-

resent the centralized row variable Mj,: for j = 1, . . . , P (similarly all local Nj j,: will be equal for

j Nj {j} ). The variables βj,j b , Zj,j and Zj,j are utilized to facilitate distributed minimization

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

of (6) and (8) via the alternating direction method of multipliers and eventually there will be no need to update them separately in the recursive updates obtained, since they will be written as linear functions of other optimization variables been updated. The following separable formulation of (6) is obtained

arg min µ|SQ| 1 P j SQ wj + ωP 1 P j ψj (18)

+ ξP 1 P j θj + v PP ℓ=1

Mℓ ℓ,: 1 Mℓ τ,κ,ℓ,: Mℓ τ,κ,ℓ,: 2 (Mℓ ℓ,:)T

s. to G1,j 0, Gτ,2,j(Nτ) 0, Gτ,3,j(Nτ) 0,

{[H2 τ(Nτ)]Nj wj IQ}j SQ, wj 0, {αj b 0}B b=1 P b αj b = 1, for j = 1, . . . , P,

{αj b = βj,j b }B b=1, {αj b = βj,j b }B b=1

Mj j,: = Zj,j , Mj j,: = Zj,j , for j Nj, j = 1, . . . , P.

Using similar steps, the SDP program in (8) is reformulated in the following separable formulation

arg min µ|SQ| 1 P j SQ wj + ωP 1 P j ψj (19)

+ ξP 1 P j θj + v PP ℓ=1

Nℓ ℓ,: 1 Nℓ τ,κ,ℓ,: Nℓ τ,κ,ℓ,: 2 (Nℓ ℓ,:)T

s. to Gτ,2,j(Mτ+1) 0, Gτ,3,j(Mτ+1) 0,

{[H2 τ(Mτ+1)]Nj wj IQ}j SQ, wj 0, Nj j,: = Θj,j , Nj j,: = Θj,j , for j Nj, j = 1, . . . , P,

where Gτ,2,j(Mτ), Gτ,3,j(Mτ), [H2 τ(Mτ)]Nj have the same structure given in (11),(12) and (14) respectively, after replacing M with the most recent update Mτ+1 and Nτ with N which is the main optimization variable in (8).

4.2 Separable vs. Centralized Formulation

The formulations in (18) and (19) are separable versions of (6) and (8) that will enable distributed minimization of the corresponding cost functions within a connected network of sensing units. These separable formulations will be an approximation of (6) and (8) due to the utilization of the local LMI constraints. From the equality constraints in the last two lines of (18) and (19) notice that Mj j,: = Mj,:, Nj j,: = Nj,: and {αj b = αb}B b=1 for all j Nj {j} and j = 1, . . . , P. The next result proved in Apdx. D delineates the relationship between the separable formulations in (18) and (19), and the centralized formulation in (6) and (8).

Proposition 4 The set of local LMIs [H2 τ(Nτ)]Nj wj IQ and [H2 τ(Mτ+1)]Nj wj IQ for j SQ in (18) and (19) respectively, guarantee that their minimizers M , N have rank equal to Q . Further, the local LMI constraints G1,j 0 for j = 1, . . . , P in (18) ensure that the feasible set of (18) contains a minimizing factor M and kernel coefﬁcients {α b}B b=1 such that P b α b Ab x has rank at least Q.

If the kernel dictionary D contains a unique subset of kernels whose convex combination is block diagonal with Q diagonal blocks of rank 1, then factors M , N and coefﬁcients {α b}B b=1 exist that are minimizers of both (6),(8) and (18)-(19) for sufﬁciently large ω and v parameters.

Prop. 4 implies that although the formulations in (6) [ or (8)] and (18) [or (19)] are not equivalent, the minimizing M , N and {α b}B b=1 will be selected such that rank(M ) = rank(N ) = Q and rank(P b α b Ab x) Q. These are the exact same rank requirement imposed by the ﬁrst and fourth LMI constraints in the centralized formulation in (4).

4.3 Distributed Minimization via Alternating Direction Method of Multipliers (ADMM)

ADMM (Bertsekas and Tsitsiklis, 2015; Boyd et al., 2011) is utilized to solve the minimization problems in (18) and (19) in a distributed fashion that allows communication only between singlehop neighboring units. The augmented Lagrangian associated with (18) is

Lτ,κ({MNj, αj, wj, ψj}P j=1, ζ, ζ, ξ, ξ) := µ |SQ| 1 X

j SQ wj + P 1 X

j [ωψj + ξθj] (20)

Mℓ ℓ,: 1 Mℓ τ,κ,ℓ,: Mℓ τ,κ,ℓ,: 2 (Mℓ ℓ,:)T #

j Nj ζT j,j [Mj j,: Zj,j ]T +

ζT j ,j[Mj j,:

j Nj ξT j,j [αj βj,j ] +

ξT j ,j[αj βj,j ]

h Mj j,: Zj,j 2 2 + Mj j,: Zj,j 2 2 i ,

h αj βj,j 2 2 + αj βj,j 2 2 i

where αj = [αj 1, . . . , αj B], βj,j = [βj,j 1 , . . . , βj,j B ], while ζ := {ζj,j }, ζ := { ζj,j }, ξ := {ξj,j }, ξ := { ξj,j } contain the Q 1 Lagrange multiplier vectors ζj,j and ζj ,j associated with

the constraints Mj j,: = Zj,j and Mj j,: = Zj,j , respectively, whereas the B 1 multiplier vectors ξj,j and ξj ,j are associated with the equality constraints αj = βj,j and αj = βj,j , respectively. Constant c is a nonnegative coefﬁcient imposing strict convexity and it will be acting as a step-size for updating the multipliers.

ADMM updates the variables and multipliers in (20) during alternating iteration ρ + 1 through the following step Step D1a: Unit j obtains local iterates {Mτ,κ Nj (ρ+1), αj τ,κ(ρ+1), wτ,κ j (ρ+1), ψτ,κ j (ρ+1), θτ,κ j (ρ+ 1)} by solving the following local minimization problem that stems from (20) and the constraints

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

Figure 2: Localized kernel learning and clustering tasks.

in (18), after keeping the jth unit local terms/variables

arg min MNj ,αj,wj,ψj,θj µ1j SQ|SQ| 1wj + ωP 1ψj + ξP 1θj +

Mj j,: 1 Mj τ,κ,j,: Mj τ,κ,j,: 2 (Mj j,:)T #

j Nj (ζτ,κ j,j )T (ρ)[Mj j,: Zτ,κ j,j (ρ)]T + X

j Nj ( ζτ,κ j,j )T (ρ)[Mj j ,: Zτ,κ j ,j(ρ)]T

j Nj (ξτ,κ j,j )T (ρ)[αj βτ,κ j,j (ρ)] + X

j Nj ( ξτ,κ j,j )T (ρ)[αj βτ,κ j ,j(ρ)]

h Mj j,: Zτ,κ j,j (ρ) 2 2 + Mj j ,: Zτ,κ j ,j(ρ) 2 2 i

h αj βτ,κ j,j (ρ) 2 2 + αj βτ,κ j ,j(ρ) 2 2 i ,

s. to G1,j 0, Gτ,2,j(Nτ) 0, Gτ,3,j(Nτ) 0,

1j SQ [H2 τ(Nτ)]Nj wj IQ 0, wj 0, {αj b 0}B b=1, P b αj b = 1, (21)

where 1j SQ is an indicator function equal to 1 if j SQ and zero otherwise.

Step D1b: Unit j updates the auxiliary variables Zτ,κ j,j (ρ + 1), βτ,κ j,j (ρ + 1) for j Nj via the quadratic program

{Zτ,κ j,j (ρ + 1), βτ,κ j,j (ρ + 1)} = arg min Zj,j ,βj,j (ζτ,κ j,j )T (ρ) Zj,j

( ζτ,κ j ,j)T (ρ) Zj,j (ξτ,κ j,j )T (ρ) βj,j ξT j ,j(ρ)βj,j (22)

+ 0.5 c h Mj τ,κ,j,:(ρ + 1) Zj,j 2 2 + Mj τ,κ,j,:(ρ + 1) Zj,j 2 2 i

+ 0.5 c h αj τ,κ(ρ + 1) βj,j 2 2 + αj τ,κ(ρ + 1) βj,j 2 2 i ,

which after applying ﬁrst-order optimality conditions results:

Zτ,κ j,j (ρ + 1) = 0.5 c 1[ζτ,κ j,j (ρ) + ζτ,κ j ,j(ρ)] + 0.5 [Mj τ,κ,j,:(ρ + 1) + Mj τ,κ,j,:(ρ + 1)], (23)

βτ,κ j,j (ρ + 1) = 0.5 c 1[ξτ,κ j,j (ρ) + ξτ,κ j ,j(ρ)] + 0.5 [αj τ,κ(ρ + 1) + αj τ,κ(ρ + 1)]. (24)

Step D1c: Sensing unit j = 1, . . . , P updates the Lagrange multipliers ζτ,κ j,j (ρ + 1), ζτ,κ j,j (ρ + 1), ξτ,κ j,j (ρ + 1), ξτ,κ j,j (ρ + 1) for j Nj using the gradient ascent iterations:

ζτ,κ j,j (ρ + 1)=ζτ,κ j,j (ρ)+c [Mj τ,κ,j,:(ρ + 1) Zτ,κ j,j (ρ + 1)], (25)

ζτ,κ j,j (ρ + 1)= ζτ,κ j,j (ρ)+c [Mj τ,κ,j ,:(ρ + 1) Zτ,κ j ,j(ρ + 1)], (26)

ξτ,κ j,j (ρ + 1)=ξτ,κ j,j (ρ)+c [αj τ,κ(ρ + 1) βτ,κ j,j (ρ + 1)], (27) ξτ,κ j,j (ρ + 1)= ξτ,κ j,j (ρ)+c [αj τ,κ(ρ + 1) βτ,κ j ,j(ρ + 1)]. (28)

Substituting (23) into (25) and (26), and (24) into (27) and (28), it follows that if the Lagrange multipliers are initialized such that ζτ,κ j,j (0) = ζτ,κ j ,j(0) and ξτ,κ j,j (0) = ξτ,κ j ,j(0) then ζτ,κ j,j (ρ) = ζτ,κ j ,j(ρ) and ξτ,κ j,j (ρ) = ξτ,κ j ,j(ρ) for all τ, κ and ρ indices. This implies that there is no need to keep track of the multipliers ζj ,j and ξj ,j as long as the multipliers ζj,j and ξj,j are updated using the recursions

ζτ,κ j,j (ρ + 1) = ζτ,κ j,j (ρ) + 0.5 c [Mj τ,κ,j,:(ρ + 1) Mj τ,κ,j,:(ρ + 1)], (29)

ξτ,κ j,j (ρ + 1) = ξτ,κ j,j (ρ) + 0.5 c [αj τ,κ(ρ + 1) αj τ,κ(ρ + 1)], (30)

which can be obtained after employing (23) and (24) in (25) and (27), respectively. Further, eqs. (23) and (24) are simpliﬁed

Zτ,κ j,j (ρ + 1) = 0.5 [Mj τ,κ,j,:(ρ + 1) + Mj τ,κ,j,:(ρ + 1)] (31)

βτ,κ j,j (ρ + 1) = 0.5 [αj τ,κ(ρ + 1) + αj τ,κ(ρ + 1)], (32)

and replace Zτ,κ j,j (ρ + 1) and βτ,κ j,j (ρ + 1) in (21) with expressions in (31) and (32), respectively. Thus, Zτ,κ j,j , βτ,κ j,j do not have to be updated as separate variables. The formulation in (21) can be transformed in a SDP formulation after introducing local variables δ1 j,j , δ2 j,j , and δ3 j,j for j Nj, that will replace the last four quadratic terms in the cost of (21) by introducing the inequality constraints

Mj j,: Zτ,κ j,j (ρ) 2 2 δ1 j,j , Mj j ,: Zτ,κ j ,j(ρ) 2 2 δ2 j,j

αj βτ,κ j,j (ρ) 2 2 = αj βτ,κ j ,j(ρ) 2 2 δ3 j,j , (33)

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

which can be rewritten using the following LMIs

" δ1 j,j IQ (Mj j,: Zτ,κ j,j (ρ))T

Mj j,: Zτ,κ j,j (ρ) 1

" δ2 j,j IQ (Mj j ,: Zτ,κ j ,j(ρ))T

Mj j ,: Zτ,κ j ,j(ρ) 1

" δ3 j,j IQ (αj βτ,κ j,j (ρ))T

αj βτ,κ j,j (ρ) 1

Thus, the minimization formulation (21) in Step D1 can be rewritten as an SDP in the following way:

arg min MNj ,αj,wj,ψj,θj µ1j SQ|SQ| 1wj + ωP 1ψj + ξP 1θj +

Mj j,: 1 Mj τ,κ,j,: Mj τ,κ,j,: 2 (Mj j,:)T #

j Nj (ζτ,κ j,j )T (ρ)[Mj j,: Zτ,κ j,j (ρ)] X

j Nj (ζτ,κ j ,j)T (ρ)[Mj j ,: Zτ,κ j ,j(ρ)]

j Nj (ξτ,κ j,j )T (ρ)[αj βτ,κ j,j (ρ)] X

j Nj (ξτ,κ j ,j)T (ρ)[αj βτ,κ j ,j(ρ)]

δ1 j,j + δ2 j,j + 2 δ3 j,j

subject to G1,j 0, Gτ,2,j(Nτ) 0, Gτ,3,j(Nτ) 0, 1j SQ [H2 τ(Nτ)]Nj wj IQ 0,

1 j,j 0, 2 j,j 0, 3 j,j 0, j Nj, wj 0, {αj b 0}B b=1, P b αj b = 1, (35)

where the equation ζτ,κ j,j (ρ) = ζτ,κ j ,j(ρ) and ξτ,κ j,j (ρ) = ξτ,κ j ,j(ρ) for all τ, κ and ρ has been employed, while (31) and (32) can be employed to replace Zτ,κ j,j (ρ) and βτ,κ j,j (ρ). Essentially the distributed steps D1a-D1c facilitate solving the SDP cost (6) involved in Step 1b of the centralized algorithm in Sec. 3.1, i.e., provide a distributed implementation of line 6 in Alg. 1. Steps D1a-D1c essentially consist a third recursive layer nested within that τ, and κ iterations in Alg. 1. A similar process can be obtained when tackling (8) via ADMM that involves the following steps Step D2a: Sensing unit j = 1, . . . , P obtains local iterates {Nτ,κ Nj (ρ+1), wτ,κ j (ρ+1), ψτ,κ j (ρ+ 1), θτ,κ j (ρ + 1)} by solving

arg min NNj ,wj,ψj,θj µ1j SQ|SQ| 1wj + ωP 1ψj + ξP 1θj +

Nj j,: 1 Nj τ,κ,j,: Nj τ,κ,j,: 2 (Nj j,:)T #

j Nj ( ζ τ,κ j,j )T (ρ)[Nj j,: Zτ,κ j,j (ρ)] X

j Nj ( ζ τ,κ j ,j)T (ρ)[Nj j ,: Zτ,κ j ,j(ρ)] + 0.5 c X

h δ1 j,j + δ2 j,j i

s. to Gτ,2,j(Mτ+1) 0, Gτ,3,j(Mτ+1) 0,

1j SQ [H2 τ(Mτ+1)]Nj wj IQ 0,

1 j,j 0, 2 j,j 0, j Nj, wj 0,

Algorithm 2 Distributed Kernel Selection and Clustering (DKC)

1: Each unit j = 1, . . . , P initializes the multipliers ζτ,κ j,j (0), and ξτ,κ j,j (0), e.g., set them to all-zero vectors . 2: for ρ = 0, 1, 2..., ρt do 3: Unit j = 1, . . . , P transmits multipliers ζτ,κ j,j (ρ), ξτ,κ j,j (ρ), row factors Mj τ,κ,j ,:(ρ+1), Mj τ,κ,j,:(ρ+1) and kernel coefﬁcients αj τ,κ(ρ + 1) to its neighbors j Nj.

4: Unit j = 1, . . . , P receives from neighbors j Nj the factors Mj

τ,κ,j,:(ρ + 1), Mj

τ,κ,j ,:(ρ + 1), the kernel coefﬁcients αj τ,κ(ρ + 1) and the Lagrange multipliers ζτ,κ j ,j(ρ) and ξτ,κ j ,j(ρ). 5: Update the auxiliary variables Zτ,κ j,j (ρ), Zτ,κ j ,j(ρ) and βτ,κ j,j (ρ) using (31) and (32).

6: Unit j = 1, . . . , P fsolves the local SDP program in (35) to obtain the updates Mτ,κ Nj (ρ+1), αj τ,κ(ρ+ 1). This can be done locally at unit j using interior point methods. 7: Unit j updates the local Lagrange multipliers ζτ,κ j,j (ρ + 1) and βτ,κ j,j (ρ + 1) for j Nj using (29) and (30). 8: end for 9: Form Mτ,κ+1 using the locally factor rows, i.e., Mτ,κ+1 = {Mj τ,κ,j,:(ρt + 1)}P j=1, and ατ,κ+1 = αj τ,κ(ρt + 1) for any j.

where local variables and multipliers ζ τ,κ j,j (ρ), ξ τ,κ j,j (ρ), Zτ,κ j,j (ρ), β τ,κ j,j (ρ), δ1 j,j , δ2 j,j , 1 j,j , 2 j,j , are deﬁned similarly to the corresponding variables without the notation in (35), and replacing M with N, δ with δ, with and Z with Z in (33)-(34). Step D2b: Sensing unit j = 1, . . . , P updates the auxiliary variables, i.e., Zτ,κ j,j (ρ + 1) for j Nj via [similar process as in (23)]

Zτ,κ j,j (ρ + 1) = 0.5 [Nj τ,κ,j,:(ρ + 1) + Nj τ,κ,j,:(ρ + 1)], (37)

and can be used in place of Zτ,κ j,j (ρ+1) in (36). Thus, Zτ,κ j,j does not have to be updated as a separate variable.

Step D2c: Sensing unit j = 1, . . . , P updates the Lagrange multipliers ζ τ,κ j,j (ρ + 1) using the gradient ascent iterations:

ζ τ,κ j,j (ρ + 1) = ζ τ,κ j,j (ρ) + 0.5 c [Nj τ,κ,j,:(ρ + 1) Nj τ,κ,j,:(ρ + 1)], (38)

Alg. 2 tabulates in detail the steps involved in implementing D1a-D1c (similarly for D2a-D2c); see also Fig. 2. Alg. 2 can also be used to summarize steps D2a-D2c after replacing i) ζτ,κ j,j with ζ τ,κ j,j ;

ii) Mj τ,κ,j,:(ρ+1) with Nj τ,κ,j,:(ρ+1); iii) Zτ,κ j ,j(ρ) with Zτ,κ j ,j(ρ); iv) remove αj,τ,κ, ξτ,κ j ,j, and βτ,κ j,j ; and replace (35) with (36). Further, replace eqs. (31) and (32) with (37), and eqs. (29) and (30) with (38). Note that ρt denotes the number of ADMM iterations applied for every τ and κ iterations. Communication Costs: During steps D1a-D1c unit j has to receive i) |Nj| Lagrange multiplier vectors {ζτ,κ j ,j(ρ)}j Nj, 2 |Nj| row factors {Mj τ,κ,j,:(ρ + 1)}j Nj, {Mj ,τ,κ j ,: (ρ + 1)}j Nj each having Q scalar entries; and ii) |Nj| Lagrange multiplier vectors {ξτ,κ j ,j(ρ)}j Nj, and |Nj| kernel

coefﬁcient vectors {αj τ,κ(ρ+1)}j Nj of length B. Thus, the number of scalars unit j has to receive during ADMM iteration ρ amounts to 3|Nj| Q + 2|Nj| B. Further, unit j has to transmit i) |Nj| Lagrange multiplier vectors {ζτ,κ j,j (ρ)}j Nj, |Nj| + 1

row factors {{Mj τ,κ,j ,:(ρ + 1)}j Nj, Mj τ,κ,j,:(ρ + 1)} each of length Q; and ii) |Nj| Lagrange

multiplier vectors {ξτ,κ j,j (ρ)}j Nj, and one kernel coefﬁcient vector {αj τ,κ(ρ + 1)}j Nj each of

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

length B. Thus, the total number of scalars to be transmitted during ADMM iteration ρ amounts to (2 |Nj| + 1)Q + (|Nj| + 1)B. During steps D2a-D2c unit j has to receive |Nj| Lagrange multiplier vectors { ζ τ,κ j ,j(ρ)}j Nj,

2 |Nj| row factors {{Nj τ,κ,j,:(ρ + 1)}j Nj, {Nj τ,κ,j ,:(ρ + 1)}}j Nj each having Q scalar entries. Thus, the number of scalars unit j has to receive during iteration ρ amounts to 3|Nj| Q. Further, unit j has to transmit |Nj| Lagrange multiplier vectors { ζ τ,κ j,j (ρ)}j Nj, |Nj| + 1 row factors {Nj τ,κ,j ,:(ρ + 1)}j Nj, Nj τ,κ,j,:(ρ + 1) each of length Q. Thus, the total number of scalars to be transmitted during ADMM iteration ρ amounts to (2 |Nj| + 1)Q. In summary, the communication cost is proportional to the neighborhood size |Nj|, the number of different classes/sources Q and the size of the kernel dictionary B which in practice are much smaller compared to the number of sensing agents P. Remark on ﬁnding set SQ: The clustering problem does not have to be solved to determine which such nodes belong in SQ and have access to neighborhood measurements that cover all Q sources present. Speciﬁcally, each unit Sj can solve a local version of (4) [considering only its neighborhood observations and corresponding kernel similarity matrices]. This can be done once during an initialization phase. If the neighborhood Nj is large enough such that the associated measurements contain information about all Q sources, then when solving a local version of (4) at the candidate unit Sj the optimal variable w in (4) will be strictly positive (details in Apdx. A), whereas if the neighborhood measurements were covering less than Q sources then w will be zero (rank-reduced M). Thus, each unit Sj can identify whether they belong to a set SQ and adjust their corresponding constraints in steps D1a and D2a accordingly.

The same idea can be applied when setting up the network of sensing units during an initialization phase to ensure the presence of such a node, i.e., nonempty set SQ. During initialization, one such node Sj with longer receiving range capabilities can start gradually increasing reception range, thus incrementally increasing its neighborhood radius, and therefore the set of neighborhood measurements xj for j Nj that it can access. Unit Sj can then solve a local version of (4) with each current neighborhood size and if w = 0 then it keeps increasing the neighborhood size (or communication range) , until w > 0 in which case Sj knows its neighborhood measurements cover all Q sources present. Distributed Convergence: Given that the separable formulations in (18) and (19) are convex, we prove in Apdx. E that the updates {Mτ,κ Nj (ρ + 1), αj τ,κ(ρ + 1)} and Nτ,κ Nj (ρ + 1) obtained via steps D1a-D1c and D2a-D2c to cluster the data and determine the kernel similarity matrix, converge, i.e.,

Proposition 5 Under the assumptions of Prop. 4 and for ﬁxed iteration indices τ, κ the ADMM iterates from (18) and (19) {Mτ,κ Nj (ρ + 1), Nτ,κ Nj (ρ + 1), αj τ,κ(ρ + 1)} converge as ρ , i.e.,

lim ρ Mj τ,κ,j,:(ρ + 1) = Mτ,κ+1,j,:, (39)

lim ρ Nj τ,κ,j,:(ρ + 1) = Nτ,κ+1,j,:, for j Nj {j} (40)

lim ρ αj τ,κ(ρ + 1) = ατ,κ+1,

for j = 1, . . . , P, implying that all local vectors {Mj τ,κ,j,:(ρ + 1), Nj τ,κ,j,:(ρ + 1)}j Nj {j} con-

verge to Mτ,κ+1,j,: and Nτ,κ+1,j,: respectively, while all local coefﬁcient vectors {αj τ,κ(ρ + 1)}P j=1

converge to the same limit ατ,κ+1; thus satisfying the equality constraints in (18) and (19). Further,

j SQ wτ,κ j (ρ)+ω

j [ψτ,κ j (ρ) + ξθτ,κ j (ρ)] (41)

Mℓ τ,κ,ℓ,:(ρ) 1 Mℓ τ,κ,ℓ,: Mℓ τ,κ,ℓ,: 2 (Mℓ τ,κ,ℓ,:(ρ))T ##

j SQ wτ,κ j (ρ)+ω

j [ψτ,κ j (ρ) + ξθτ,κ j (ρ)] (42)

Nℓ τ,κ,ℓ,:(ρ) 1 Nℓ τ,κ,ℓ,: Nℓ τ,κ,ℓ,: 2 (Nℓ τ,κ,ℓ,:(ρ))T ##

where f τ,κ and g τ,κ correspond to the minimum values of the costs in (18) and (19) respectively.

The result of Prop. (5) combined with Prop. 4 implies that for ρ the factor and coefﬁcient iterates {Mj τ,κ,j,:(ρ + 1), Nj τ,κ,j,:(ρ + 1), αj τ,κ(ρ + 1)}P j=1 will converge to limits Mτ,κ+1,j,:, Nτ,κ+1,j,: and ατ,κ+1 such that P b ατ,κ+1,b Ab x is block diagonal with rank Q as τ, κ . Thus, despite the fact that the converging limits of the centralized and distributed algorithms may not be the same, they share the block diagonal property crucial for facilitating the clustering task. In practice a sufﬁciently large ﬁnite number of ADMM iterations ρt < sufﬁces to approach convergence. This is demonstrated using the numerical examples following next.

Computational Complexity: The computational complexity of our novel framework is compared with the complexity of the tensor multiple kernel graph-based clustering (TMKGC) scheme in (Ren et al., 2020), the structure-preserving multiple kernel clustering approach (SPMKC) in (Ren and Sun, 2020) and the K-means algorithm.

The computational complexity per iteration in TMKGC (Ren et al., 2020) is O(BP 2 log2(P) + B2P 2) where B is the number of elements in the kernel dictionary, P the number of data vectors (or sensing units) and Q the number of underlying sources. The computational complexity per iteration in SPMKC (Ren and Sun, 2020) is O(P 3). The computational complexity of K-means is O(P 2).

The computational complexity in the novel distributed kernel clustering (DKC) scheme per node per iteration is mainly accounting for steps D1a-D1c and D2a-D2c. These can be be divided into two tasks at unit j: i) Updating the |Nj| multipliers with a cumulative complexity of O(|Nj|(B + Q)); and ii) Solve the local SDP problems involved in steps D1c and D2c with a complexity of O((B + Q)4) and O((Q)4) respectively [see e.g., (Vandenberghe et al., 2005)].

It is practical to assume B << P and Q << P, since the size of the kernel dictionary B and the number of underlying sources Q are smaller than the number of measurement vectors/ sensing units P.

Comparing the computational complexity of DKC and TMKGC, we study under what conditions the following inequality

|Nj|(B+Q)+(B+Q)4+Q4 < B2P 2 |Nj|

2 (B+Q)2+ Q4

B2 < P 2, (43)

is satisﬁed. Taking into account that in practice P >> B, Q and Q < B since the number of elements in the dictionary can be chosen to be larger than the number of sources Q then (43) can be

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

satisﬁed if

which holds true for large P and B, Q << P.

Comparing the computational complexity of DKC and SPMKC, we study under what conditions the following inequality

|Nj|(B + Q) + (B + Q)4 + Q4 < P 3 |Nj|B + Q

P 3 + B + Q

3 (B + Q) + Q4

P 3 < 1, (45)

is satisﬁed. The second inequality in (45) can be easily satisﬁed when B, Q << P. Thus, the computational complexity of our approach per unit per iteration is deﬁnitely lower when the amount of data vectors P is considerably larger than the cardinality of the dictionary B and the number of sources Q.

When studying under what conditions our approach has lower computational complexity per iteration per unit than K-means, i.e.,

|Nj|(B + Q) + (B + Q)4 + Q4 < P 2 |Nj|B + Q

P 2 + B + Q

2 (B + Q)2 + Q4

P 2 < 1, (46)

then the computational complexity of our approach will be lower than the one of K-means when the amount of data vectors P is considerably larger than the cardinality of the dictionary B and the number of sources Q.

5. Numerical Simulations

Numerical tests are utilized next to study the effectiveness of the proposed algorithmic framework on multiple datasets. We compare the performance of the novel approach with competing alternatives using three different datasets: (1) The Unimib dataset (Micucci et al., 2017) corresponding to a collection of smartphone-based human activity detection readings with Q = 3 different classes; (2) The Salinas dataset in (Sal, 2021) which consists of 3-dimensional hyperspectral images with Q = 4; and (3) a synthetic dataset with Q = 4. The performance of the novel clustering and learning scheme proposed in this paper (abbreviated as CKC for the centralized version and DKC for the distributed version) is compared with: (1) the tensor multiple kernel graph-based clustering (TMKGC) scheme in (Ren et al., 2020); (2) the structure-preserving multiple kernel clustering approach (SPMKC) in (Ren and Sun, 2020); and (3) the K-means clustering algorithm using Euclidean distance. Note that the distributed K-means approaches in (Chen et al., 2016; Oliva et al., 2013; Qin et al., 2016; Tsapanos et al., 2015) are performing at best similar to the centralized K-means. This is the reason we are using the centralized K-means for comparisons reasons. Further, the only other distributed kernel-learning clustering approach available is the one in (Ren et al., 2020) for which we have performed extensive comparisons. It has to be emphasized though that this comparison is unfair for our approach, in the sense that TMKGC requires a central processing center acting as a fusion center, whereas our framework does not require a fusion center and can operate in ad hoc network architectures.

Three ﬁgures of merit are employed to compare the clustering performance of CKC/DKC with the aforementioned alternatives: accuracy, NMI, and purity. Accuracy quantiﬁes the success per-

centage in clustering data in the correct groups

Accuracy = P 1 PP j=1 δ( Cj map( ˆCj)) (47)

where ˆCj is the cluster label returned by the clustering algorithms considered, whereas Cj is the ground truth label, δ( ) denotes the Kronecker delta function and map( ˆCj) is the mapping of cluster label ˆCj to a class label using the Hungarian assignment algorithm (Kuhn, 1955). NMI quantiﬁes the quality of clusters NMI(Ψ, Ψ) = I(Ψ; Ψ)

H(Ψ) H( Ψ),

where I and H refer to the mutual information and entropy respectively evaluated as:

" P |Ψq Ψp|

where Ψ = {Ψ1, Ψ2, ..., ΨQ} correspond to the Q clusters found by the clustering scheme, while Ψ = { Ψ1, Ψ2, ..., ΨQ} denotes the ground-truth cluster sets that contain correctly assigned data entries. It pertains more to a setting where the data labels are given (not the case in our setting) but it is provided for a complete performance analysis. Purity quantiﬁes the degree at which the found clusters contain elements from one class. Purity is calculated as Purity = P 1 PQ q=1 maxj=1,...,Q|Ψq Ψj|.

Each cluster is matched with the class set that has the most overlap and the number of correctly assigned elements is used, i.e., maxj=1,...,Q|Ψq Ψj|. The performance of DKC will be tested in an ad hoc connected network of sensing units comprising of P = 60 units placed randomly in the area [0, 1] [0, 1]. Two sensing units are connected if their Euclidean distance is less than 0.25. For the synthetic data, and the Salinas dataset the set of sensors SQ whose single-hop neighbors sense all Q clusters has 7 units, whereas in the Unimib dataset it has 8.

5.1 Synthetic Data

The synthetic dataset comprises of Q = 4 different classes modeled as a kernel dictionary with six 60 60 kernel elements {A1(x)}6 b=1. Kernel matrix A1 x is equal to a 60 60 random matrix. Kernel matrix A2 x consists of two diagonal blocks occupying the ﬁrst 15 rows and columns, and rows and columns with indices 29 to 46, respectively. Dictionary kernel A3 x contains one diagonal block occupying rows with indices 16 to 28 and columns 16 to 28 while the remainder of the entries are set to zeros. Dictionary kernel A4 x contains one diagonal block occupying rows with indices 47 to 60 and columns 47 to 60, while the remainder of the entries are set to zeros. Additionally, A5 x is set to an all-ones matrix normalized to have trace one, while A6 x is set equal to the identity matrix normalized to have trace one.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

Fig. 3 depicts the trajectory of the six kernel coefﬁcients {αb}6 b=1 versus the DCA number of iterations. Each DCA iteration entails ρt = 5 ADMM iterations when utilizing Alg. 2 to tackle the tasks in line 6 and 14 of Alg. 1 (DKC). All kernel coefﬁcients are initialized at the same value {α0,0,b = 1/6}6 b=1. Throughout Sec. 5, the kernel factors Mj j,: and Nj j,: for j SQ are initialized applying CKC locally at each sensing unit j SQ, while the rest of the sensing units j / SQ set the entries of Mj j,: and Nj j,: to zero initially. As the DKC algorithm progresses through iterations it can be seen only the coefﬁcients weighing the block diagonal kernels in the dictionary are converging to a nonzero value (α2, α3, α4), while the rest α1, α5, α6 converge to zero since they correspond to kernel matrices that do not promote a block diagonal structure in the synthetic dictionary. Figs. 4 and 5 depict the accuracy, purity and NMI for DKC vsersus DCA iteration index for various number of ADMM iterations ρt. The parameters for DKC where set as v = 0.01, µ = 10, ω = 2, ξ = 15 and c = 0.01. All methods in this synthetic example are able to reach accuracy, NMI and purity equal to one, though the novel DKC approach does not require the presence of a central processor. Note that as the number of ADMM iterations increases the rate of convergence goes up; especially when increasing from ρt = 1 to ρt = 3 beyond which the rate gains are negligible. Fig. 3 (Right) depicts the cumulative disagreement between the local kernel coefﬁcients across all 60 units, i.e., P j,j δ3 j,j in (36). It can be seen that the disagreement converges to zero demonstrating that all units estimate the same kernel coefﬁcients and therefore same kernel similarity matrix.

0 5 10 15 20 25 30 35 40 45 DCA iteration

Kernel Coefficient Values

Figure 3: Trajectory of the kernel coefﬁcients {αb}6 b=1 vs. DCA iteration index.

5.2 Unimib Dataset

The second dataset corresponds to the University of Milano Bicocca Smartphone-based Human Activity Recognition (Unimib) which consists of accelerometer readings of users that participated in activities such as walking, running, and climbing stairs (Q = 3). The signals are pre-processed such that the data vectors are grouped into individual epochs with each of them conformed of 51 samples in length and centered around the peak of the epochs. Since the accelerometer readings are considered along all the 3-D axes, the concatenated signal is 153 samples long. In the distributed setting, unit j acquires data entry xt(j); 20 sensing units acquire data entries xt(j) corresponding

0 10 20 30 40 50 DCA iteration

0 10 20 30 40 50 60 DCA iteration

Figure 4: (Left) Accuracy vs. DCA iteration index for different number of ADMM iterations; (Right) NMI vs. DCA iteration index.

0 10 20 30 40 50 60 DCA iteration

0 10 20 30 40 50 60 DCA Iterations

Spatial disagreement of kernel coefficients { b}b=1 6

Figure 5: (Left) Purity vs. DCA iteration index for different number of ADMM iterations; (Right) Cumulative spatial disagreement between the local kernel coefﬁcients across all units vs. DCA iteration index.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

to walking, 20 units corresponding to running, and 20 units correspond to climbing stairs. The objective is to cluster the signals based on the activity class they belong to. Figs. 6-8 depict a box plot of the accuracy, NMI and purity respectively of DKC, TMKGC, SPMKC and K-means averaged over 30 independent trials of the Unimib dataset. The parameters utilized for DKC are v = 0.01, µ = 10, ω = 2, ξ = 15 and c = 0.01, further the kernel dictionary D consists of 17 candidate kernel matrices among which 10 are Gaussian with variances in the interval [10 4, 108] and 7 are polynomial with degrees in {1, . . . , 7}. For SPMKC parameter values that gave good performance were λ1 = 4, λ2 = 1, λ3 = 400 and λ4 = 1, while for TMKGC α = 0.1, β = 10 3. The red mark inside these box plots indicates the median accuracy for each method and the edges of the box mark the 25 and 75 percentiles of the accuracy across the number of trials utilized. Fig. 6 and 8 show that both CKC and DKC outperform TMKGC, SPMKC as well as K-Means in terms of accuracy and purity across 30 independent trials of Unimib data. CKC is formulated to learn a block diagonal similarity structure of certain rank in contrast to TMKGC. Interestingly, DKC also outperforms TMKGC showing the effectiveness of the collaboration between singlehop neighboring units in carrying out the clustering task. For the NMI in Fig. 7 TMKGC got the largest value and our novel CKC/DKC framework has the second highest value, outperforming SPMKC and K-Means. One reason is the utilization of tensors in TMKGC that leads to higher-order graph learning, leading to better quality clusters, though at the cost of much higher computational complexity.

CKC DKC TMKGC SPMKC K-Means

Figure 6: Accuracy box plot over 30 independent trials on the Unimib dataset.

Figs. 9-10 depict the accuracy, NMI, purity and cumulative disagreement of DKC, versus block coordinate descent iteration index τ for different number of DCA iterations K and ADMM iterations ρt. For comparison the accuracy, NMI and purity of centralized TMKGC, SPMKC and K-means methods are also provided. The results are depicted for the parameter values selected earlier. One dataset among the 30 independent ones was utilized here to show the convergence properties of DKC. Fig. 9 shows that both the accuracy and NMI, as the number of block iterations τ increases, converge to the corresponding values achieved by CKC for sufﬁciently large number of ADMM iterations ρτ and DCA iterations K. CKC and TMKGC have close accuracies, though DKC running for limited K and ρt falls short of the centralized performance. NMI is still better for

CKC DKC TMKGC SPMKC K-Means

Figure 7: NMI box plot over 30 independent trials on the Unimib dataset.

CKC DKC TMKGC SPMKC K-Means

Figure 8: Purity box plot over 30 independent trials on the Unimib dataset.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

TMKGC for this dataset. In fact DKC outperforms the clustering performance metrics achieved by the centralized approaches SPMKC and K-Means. This advocates the collaborative nature of our novel approach which allows information to diffuse more and more across the agents as the ADMM iterations increase.

0 2 4 6 8 10 12 14 16 18 20 Block coordinate descent iteration index

CKC TMKGC SPMKC K-Means DKC (K=3, t=5)

DKC (K=1, t=5)

DKC (K=1, t=1)

0 2 4 6 8 10 12 14 16 18 20 Block coordinate descent iteration index

CKC SPMKC K-Means DKC (K=3, t=5)

DKC (K=1, t=5)

DKC (K=1, t=1)

Figure 9: (Left) Accuracy vs. block coordinate iteration index for different number of ADMM and DCA iterations; (Right) NMI vs. iteration index for different number of ADMM and DCA iterations on the Unimib dataset.

Similar conclusions can be drawn for the purity metric in Fig. 10 (Left). Fig. 10 (Right) depicts the cumulative disagreement between the local kernel coefﬁcients vs. block coordinate descent iteration index τ and similarly to the synthetic scenario as the ADMM iterations ρτ and DCA iterations K increase it gets closer to zero.

0 2 4 6 8 10 12 14 16 18 20 Block coordinarte descent iteration

TMKGC CKC SPMKC K-Means DKC (K=3, t=5)

DKC (K=1, t=5)

DKC (K=1, t=1)

2 4 6 8 10 12 14 16 18 Block coordinate descend teration index

Spatial disagreement of kernel coefficients

DKC (K=1, t=1)

DKC (K=1, t=5)

DKC (K=3, t=5)

Figure 10: (Left) Purity vs. block coordinate iteration index for different number of ADMM and DCA iterations; (Right) Cumulative spatial disagreement between the local kernel coefﬁcients across all units vs. block coordinate iteration index for different number of ADMM and DCA iterations on the Unimib dataset.

5.3 Salinas Dataset

This is a hyperspectral image dataset captured by an Aviris sensing system over the Salinas valley, California. These are primarily farmland images that indicate the presence of different crops/materials in different parts of the images. Each sensing unit measurement vector xi has 224 entries; each of these vectors will be clustered into different groups based on the crop/material they belong to. A total of Q = 4 different randomly selected materials were considered and 25 random pixels were chosen. In the distributed setting, unit j acquires data entry xt(j) for j = 1, . . . , 60; 15 sensing units acquire data entries xt(j) corresponding to the ﬁrst material, 13 units corresponding to the second material, 18 units correspond to the third material and 14 units to the fourth material. The objective is to cluster the signals based on the crop/material class they belong to. The parameters utilized for DKC are v = 0.01, µ = 10, ω = 2, ξ = 15 and c = 1. Further, the kernel dictionary D consists of 17 candidate kernel matrices among which 10 are Gaussian with variances in the interval [10 4, 108] and 7 are polynomial with degrees in {1, . . . , 7}. For SPMKC values that gave good performance were λ1 = 6, λ2 = 1, λ3 = 600 and λ4 = 1, while for TMKGC α = 0.1, β = 10 4. Figs. 11-13 depict a box plot of the accuracy, purity and NMI respectively of DKC, TMKGC, SPMKC and K-means averaged over 50 independent trials of the Salinas dataset. The red mark inside these box plots indicates the median accuracy for each method and the edges of the box mark the 25 and 75 percentiles of the accuracy across the number of trials utilized.

CKC DKC TMKGC SPMKC K-Means

Figure 11: Accuracy box plot over 50 independent trials on the Salinas dataset.

Fig. 11 and 13 show that both CKC outperforms TMKGC, SPMKC as well as K-Means in terms of accuracy and purity. DKC accuracy gets really close to the centralized one achieved by CKC. For the NMI, which here is given for completeness since labels are not available, TMKGC got the largest value, though CKC and DKC get really close and outperform SPMKC and K-Means. NMI is better in TMKGC again due to the usage of a tensor based formulation extracting higher-order graph similarities.

Figs. 14-15 depict the accuracy, NMI, purity and cumulative disagreement of DKC, versus block coordinate descent iteration index τ for different number of DCA iterations K and ADMM iterations ρt. For comparison the accuracy, NMI and purity of centralized TMKGC, SPMKC and K-

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

CKC DKC TMKGC SPMKC K-Means

Figure 12: NMI box plot over 50 independent trials on the Salinas dataset.

CKC DKC TMKGC SPMKC K-Means

Figure 13: Purity box plot over 50 independent trials on the Salinas dataset.

means methods are also provided. The results are depicted for the parameter values selected earlier. One dataset among the 50 independent Salinas datasets was utilized here to show the convergence properties of DKC. Fig. 14 shows that both accuracy and NMI, as the number of block iterations τ increases, approach closely the corresponding values achieved by CKC as the number of ADMM iterations ρτ and DCA iterations K increase. In fact DKC outperforms the clustering performance metrics achieved by the centralized approaches TMKGC, SPMKC and K-Means for sufﬁciently large number of ADMM and DCA iterations. The results here are depicted for one dataset among the 50 realizations for which case CKC outperforms TMKGC. Again the objective is to demonstrate the potential of DKC in getting close to the CKC performance and potentially outperforming alternative centralized approaches.

0 2 4 6 8 10 12 14 Block coordinate iteration index

CKC TMKGC, SPMKC DKC (K=1, t=1)

DKC (K=1, t=3)

DKC (K=3, t=3)

0 2 4 6 8 10 12 14 Block coordinate iteration index

CKC TMKGC DKC (K=1, t=1)

DKC (K=1, t=3)

DKC (K=3, t=3)

SPMKC K-Means

Figure 14: (Left) Accuracy vs. block coordinate iteration index for different number of ADMM and DCA iterations; (Right) NMI vs. iteration index for different number of ADMM and DCA iterations on the Salinas dataset.

Similar conclusions can be drawn for the purity metric in Fig. 15 (Left). Fig. 15 (Right) depicts the cumulative spatial disagreement between the local kernel coefﬁcients vs. block coordinate descent iteration index τ and similarly to the synthetic scenario as the ADMM iterations ρτ and DCA iterations K increase it gets closer to zero. Again Figs. 14-15 advocate the collaborative nature of our novel approach which allows information to diffuse more and more across the network as the ADMM iterations increase.

6. Conclusions

A novel distributed framework for joint kernel learning and clustering was derived capable of determining clustering conﬁgurations in an unsupervised manner. Utilizing principles from semideﬁnite programming, block coordinate descent and difference of convex formulations we arrive at a minimization formulation that facilitates the selection of proper block diagonal kernel similarity matrices that allow effective unsupervised data clustering. The SDP problems solved during a block coordinate cycle are further reformulated in a separable fashion allowing the application of ADMM which leads to a fully distributed joint kernel learning and clustering approach. Convergence guarantees demonstrate that the novel framework promotes the construction of block diagonal data similar-

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

0 2 4 6 8 10 12 14 Block coordinate iteration index

CKC TMKGC, SPMKC DKC (K=1, t=1)

DKC (K=1, t=3)

DKC (K=3, t=3)

2 4 6 8 10 12 Block coordinate descent iteration

Spatial disagreement of kernel coefficients

DKC (K=1, t=1)

DKC (K=1, t=3)

DKC (K=3, t=3)

Figure 15: (Top) Purity vs. block coordinate iteration index for different number of ADMM and DCA iterations; (Bottom) Cumulative disagreement between the local kernel coefﬁcients across all sensing units vs. block coordinate iteration index for different number of ADMM and DCA iterations on the Salinas dataset.

ity matrices. Detailed numerical examples show the superior clustering accuracy achieved by the distributed framework over existing alternatives. Last but not least, it should be emphasized that multi-hop communication typically comes with severe challenges on message reception, scheduling and routing to name a few. In this work we emphasized more on the learning aspects of the clustering problem. Moreover, kernel dictionary selection and unknown number of sources are also practical aspects that should be considered. Future work will focus on addressing the aforementioned difﬁculties to generalize the capabilities of our novel clustering framework.

Acknowledgments

The work in this paper was supported by ARO grant W911NF-21-1-0231.

Appendix A. Proof of Proposition 1

Assume that the dictionary of kernels D := {Ab x}B b=1 contains elements for which there exist nonnegative scalars β1, . . . , βB with PB b=1 βb = 1 such that PB b=1 βb Ab x is block diagonal with Q diagonal blocks and equal rank. We show that the formulation in (4) is able to ﬁnd a set of optimal α b coefﬁcients such that PB b=1 α b Ab x is also block diagonal with Q diagonal blocks each of rank one. Employing the Schur complement (Boyd and Vandenberghe, 2004) the ﬁrst matrix inequality in (4) is rewritten IQ MT

M PB b=1 αb Ab x

b=1 αb Ab x M MT . (49)

The fourth matrix inequality (NT M + MT N) w IQ, w > 0 in (4) implies that the rank of M and N should be equal to Q. To prove this consider the singular value decomposition of MT N = UmnΣmn VT mn where Umn and Vmn are Q Q orthonormal matrices, while Σmn is a Q Q diagonal matrix. Let us assume that rank(MT N) = Q Z < Q, and let VZ RQ Z correspond to these columns of Vmn that span the nullspace of MT N; note that the dimensionality of nullpace(MT N) is equal to Z > 0. Given that VT ZVZ = IZ we multiply the left and right parts of the matrix inequality in (4) from the left with VT Z and the right with VZ to obtain

VT Z(UmnΣmn VT mn + VmnΣmn UT mn)VZ w IZ (50)

VT ZUmnΣmn VT mn VZ + VT ZVmnΣmn UT mn VZ w IZ.

Assuming that MT N is rank deﬁcient, then the nullspace of MT N is nonempty with dimensionality Z 1. Then, it follows that the lhs side of the second inequality in (50) will be zero, and 0Z Z w IZ which cannot be true since w > 0. Therefore matrix MT N has full rank equal to Q, therefore both factors M and N have full rank equal to Q when w > 0 [for sufﬁciently large coefﬁcient µ in (4)]. Thus, from (49) PB b=1 α b Ab x will have rank at least Q for w > 0. The second matrix inequality in (4) gives

" θ QIP M N (M N)T 1 QIQ

0 M N 2 F θ, (51)

which for sufﬁciently large ξ ensures that M and N are approximately equal. From the Schur complement the third matrix inequality in (4)

P IP PB b=1 αb Ab x M NT

(PB b=1 αb Ab x M NT )T 1

b=1 αb Ab x M NT

Optimal ψ = 0, since there exists a convex combination PB b=1 βb Axb which has rank Q. ψ = 0 can be achieved by choosing optimal α b that result a rank Q kernel PB b=1 α b Ab x, while there exist optimal factors M , N such that PB b=1 α b Ab x = M (N )T which results ψ = 0.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

Finally the term v PP ℓ=1 [ Mℓ,: 1 Mℓ,: 2] 0 can get its lowest value of zero if factor M

has at most one nonzero element on each row M ℓ,: for ℓ= 1, . . . , Q (cf. Sec. 3). The same holds for term v PP ℓ=1 [ Nℓ,: 1 Nℓ,: 2] 0. Given the P Q factors M , N have rank Q, this implies the presence of Q linearly independent columns in M , N . Thus, the nonzero elements in each column of M should be in nonoverlaping positions resulting a block diagonal matrix PB b=1 α b Ab x with Q blocks with positions corresponding to the support of each of the Q columns of M and N . Every row M ℓ,: (and N ℓ,: ) should have exactly one nonzero entry, otherwise there is at least one diagonal block with rank greater than 1 resulting rank(PB b=1 α b Ab x) > Q which is not true.

Appendix B. Proof of Lemma 2

We establish that the cost function, namely JM,τ,κ( ), in (6) is bounded below by a ﬁnite negative number for any τ and κ [similar arguments can be applied for (4)]. Note that variable wτ,κ+1 0. For the variable ψ, after using Schur complement in the third LMI in (6), we obtain

b=1 ατ,b Ab x Mτ,κ+1NT τ

F ψτ,κ+1, (53)

from which ψτ,κ+1 0. Similarly, θτ,κ+1 0. From the constraints in (6)

wτ,κ+1 Q 1trace((Mτ,κ+1)T Nτ) (54)

PP ℓ=1 Nτ,ℓ,: 2 Mτ,κ+1,ℓ,: 2

PP ℓ=1 max( Nτ,ℓ,: 2 2, Mτ,κ+1,ℓ,: 2 2)

PP ℓ=1[ Nτ,ℓ,: 2 2 + Mτ,κ+1,ℓ,: 2 2].

The second inequality in (54) follows from the Cauchy-Schwarz inequality (Horn and Johnson, 2012). Now, from the ﬁrst LMI in (6) it follows that trace(Mτ,κ+1(Mτ,κ+1)T ) trace(P b ατ+1,b Ab x) = 1, since trace(Ab x) = 1 and P b αb = 1. The second LMI in (6) gives

Mτ,κ+1 Nτ 2 F θτ,κ+1, (55)

from which it follows that Nτ = Mτ,κ+1 + Eτ,κ+1 with Eτ,κ+1 2 F θτ,κ+1. Then, it follows that

Nτ 2 F Mτ,κ+1 2 F + Eτ,κ+1 2 F + 2trace(MT τ,κ+1Eτ,κ+1)

1 + θτ,κ+1 + 2

ℓ=1 [ Mτ,κ+1,ℓ,: 2 2 + Eτ,κ+1,ℓ,: 2 2]

3 + 3θτ,κ+1, (56)

where the second and third inequalities in (56) resulted from (54), Eτ,κ+1 2 F θτ,κ+1 and Mτ,κ+1 2 F 1. Combining (54) and (56) it follows for any τ, κ that

wτ,κ+1 Q 1 [4 + 3θτ,κ+1]. (57)

By increasing ξ, θτ,κ+1 can be made small and ﬁnite. Next we demonstrate that terms Mℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 MT ℓ,: are nonnegative. Speciﬁcally

Mτ,κ,ℓ,:MT ℓ,: Mτ,κ,ℓ,: 2 Mℓ,: 2 Mτ,κ,ℓ,: 2 Mℓ,: 1

Mℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 MT ℓ,: 0. (58)

The ﬁrst inequality in (58) is direct application of the Cauchy-Schwarz inequality (Horn and Johnson, 2012), whereas the second inequality is the result of the ℓ1 norm of a vector been larger than or equal to the ℓ2 norm. The second line in (58) follows after dividing both sides of the ﬁrst line with Mτ,κ,ℓ,: 2. Note that Mℓ,: 1 Mτ,κ,ℓ,: {Mτ,κ,ℓ,: 2 Mℓ,: = 0 if and only if Mℓ,: has one

nonzero element in which case Mℓ,: 2 = Mℓ,: 1. Thus, we conclude that the cost function JM,τ,κ( ) µ Q 1 [4 + 3θτ,κ] and therefore is bounded below from a ﬁnite negative value. The same process is applied for the cost in (4) and (8). If the cost function JM,τ,κ+1( ) reaches a negative value that implies that wτ,κ+1 > 0 otherwise the cost would be positive. This further implies from (6) that

0Q Q wτ,κ+1 IQ 0.5 (NT τ Mτ,κ+1 + Mτ,κ+1NT τ ). (59)

Using the line of arguments between eqs. (49) and (51) it follows that (59) guarantees that Mτ,κ+1 has full rank Q. Similar arguments can be applied for the costs in (4) and (8).

Appendix C. Proof of Proposition 3

Let Jτ,0(M, w, ψ, θ, α, Nτ) denote the cost function in (4), when ﬁxing factor N with Nτ. Utilizing Thm 3. in (Tao and An, 1997), it turns out that Steps 1a and 1b of Alg. 1 applied to minimize convex formulation (6) for ﬁxed N = Nτ, result a decreasing sequence of cost values, i.e., for κ = 0, 1, . . .

Jτ,κ+1(Mτ,κ+1, ατ,κ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1, Nτ)

Jτ,κ(Mτ,κ, ατ,κ, wτ,κ, ψτ,κ, θτ,κ, Nτ). (60)

Lemma 2 and (60) (monotone convergence theorem) imply

limκ Jτ,κ+1( ) = Jτ, (61)

where Jτ is ﬁnite. Thus, there exists a sufﬁciently large iteration index K 0 such that |Jτ,κ+1( ) Jτ,κ( )| ϵ, k K. Since cost Jτ,κ+1( ) is a continuous function there exists such integer K such that Mτ,κ+1 Mτ,κ F ϵ1 and ατ,κ+1 ατ,κ F ϵ1 for all κ K. This follows also from the convergence properties of DCA in (Tao and An, 1997).

At iteration κ = K the stopping criterion will be satisﬁed at the end of Step 1b. At this point the iterates will be Mτ+1, ατ+1, Nτ, wτ,K, ψτ,K and θτ,K. The corresponding cost function value of (4) will be equal to Jτ,K = µ wτ,K +ω ψτ,K +ξ θτ,K +v PP ℓ=1[ Mτ+1,ℓ,: 1 Mτ+1,ℓ,: 2]+ v PP ℓ=1[ Nτ,ℓ,: 1 Nτ,ℓ,: 2]. During iteration τ and before Steps 2a and 2b are recursively carried out, the initial cost value for (4), is equal to Jτ,K calculated for using the most recent updates Mτ+1, ατ+1, Nτ, wτ,K, ψτ,K

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

and θτ,K. During Steps 2a and 2b factor M = Mτ+1, kernel coefﬁcients α = ατ+1 and the cost in (4) is minimized wrt N, w, ψ and θ. Again using Thm. 3 in (Tao and An, 1997), it turns out that Steps 2a and 2b of Alg. 1 applied to minimize convex formulation (4) for ﬁxed M, and α result a decreasing sequence of cost values, i.e., for κ = 0, 1, . . .

Jτ,κ+1(Mτ+1, ατ+1, wτ,κ+1, ψτ,κ+1, θτ,κ+1, Nτ,κ+1)

Jτ,κ(Mτ+1, ατ+1, wτ,κ, ψτ,κ, θτ,κ, Nτ,κ), (62)

while Jτ,0(Mτ+1, ατ+1) = Jτ,K when Steps 2a and 2b start. Thus, there exists integer K such that Nτ,κ+1 Nτ,κ F ϵ1 for all κ K . From (60) and (62) the cost function values iterates Jτ(Mτ, Nτ, ατ) for (4) form a decreasing sequence, since the cost is lower bounded from below the cost iterates converge to a ﬁnite value. The continuity of cost (4), implies that a sufﬁciently large T exists such that Mτ+1 Mτ F + Nτ+1 Nτ F + ατ+1 ατ F < ϵ2 for τ > T.

Appendix D. Proof of Proposition 4

From the equality constraints in the last two lines of (18) and (19) notice that Mj j,: = Mj,:, Nj j,: =

Nj,: and αj b = αb for all j Nj and j = 1, . . . , P and b = 1, . . . , B. Next, we simplify notation accordingly. Using similar arguments as in the last two paragraphs of Apdx. B, it follows that the set of local LMIs [H2 τ(Nτ)]Nj wj IQ for j SQ for wj > 0 in (18) ensures that a minimizing factor M will satisfy

0.5 P j Nj {j}[(Nj τ,j ,:)T M j ,: + (M j ,:)T Nj τ,j ,:] wj IQ 0,

which further implies that (|Nj|+1) Q factor submatrix M Nj := h (M j,:)T , (M j ,:)T , . . . , (M j ,:)T i T

, with j, j , j Nj, has rank equal to Q, thus rank(M ) = Q. Similarly the third LMI in (19) guarantees that a minimizing N will also have rank equal to Q. The local LMI constraint G1,j 0 can be rewritten as PB b=1 αb[Ab x]Nj MNj (MNj)T 0 which for an optimal set of variables M and {α b}B b=1 produces

b=1 α b[Ab x]Nj M Nj (M Nj)T

b=1 α b[Ab x] M (M )T #

ET Nj 0, (63)

where ENj is a (|Nj| + 1) P matrix where each row has a single nonzero entry equal to one. The column indices for those nonzero entries in ENj are in the set Nj {j}. The second row in (63) is indicating that PB b=1 α b[Ab x]Nj M Nj (M Nj)T is a submatrix of PB b=1 α b[Ab x] M

(M )T . Given that PB b=1 α b[Ab x]Nj M Nj (M Nj)T 0 and rank(M Nj (M Nj)T ) = Q,

then it follows that rank(PB b=1 α b[Ab x]Nj) Q which ensures that rank(PB b=1 α b[Ab x]) Q since PB b=1 α b[Ab x]Nj is a submatrix of PB b=1 α b[Ab x]. If the kernel dictionary D contains a unique subset of kernels whose convex combination is block diagonal with Q diagonal blocks of rank 1, then there exists kernel coefﬁcients {α b}B b=1 and

P Q factors M , N such that PB b=1 α b[Ab x] = M (N )T from which it follows that for sufﬁciently large ω, Gτ,3,j(N ) = 0 which further implies that ψ j = 0 for j = 1, . . . , P in (18). Similarly, for sufﬁciently large ω in (19) , Gτ,3,j(M ) = 0 which further implies that ψ j = 0 in (19). The block diagonal structure of PB b=1 α b[Ab x] implies that M and N can have at most one nonzero entry per row. For sufﬁciently large v and DCA iterations K, if Mτ,κ = diag(c1, c2, . . . , cp)M for κ K, with {ci}P i=1 ﬁxed arbitrary scalars that results M ℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 (M ℓ,:)T (64)

= M ℓ,: 1 cℓ M ℓ,: (M ℓ,:)T

cℓ M ℓ,: 2 = M ℓ,: 1 M ℓ,: 2 = 0,

where ℓ= 1, . . . , P, the ﬁrst equality stems from Mτ,κ ℓ,: = cℓ M ℓ,: whereas the second inequality from the property that M has at most one nonzero entry per row. Thus, for sufﬁciently large v the fourth summation term in the cost in (18) will be zero. Similarly, it can shown that the fourth summation term in (19) will also be zero. Further, w j in (18) will attain the maximum possible value coinciding with the Qth eigevalue of the local matrix [H2 τ(N )]Nj formed using the submatrix M Nj. Thus, M and {α b} are minimizers of (18), and N minimizer of (19) for sufﬁciently large v and ω. Similarly, for sufﬁciently large ω and v, M , N and {α b}B b=1 result ψ = 0 following from the third LMI in (6) and second LMI in (8) being equal to a zero matrix, while M ℓ,: 1 Mτ,κ,ℓ,: Mτ,κ,ℓ,: 2 (M ℓ,:)T = N ℓ,: 1 Nτ,κ,ℓ,: Nτ,κ,ℓ,: 2 (N ℓ,:)T = 0, (65)

for ℓ= 1, . . . , P and κ K when Mτ,κ = diag(c1, c2, . . . , c P )M and Nτ,κ = diag(c 1, c 2, . . . , c P ) N , with {ci, c i}P i=1 ﬁxed arbitrary scalars. Again w will attain the maximum possible value coinciding with the Qth eigevalue of H2 τ,κ(M ). Thus, M , N and {α b}B b=1 are also minimizers of (6)-(8) for sufﬁciently large v and ω. Thus, both the central and separable formulations in (6) and (18) may not be equivalent but they share optimal solutions under the assumptions of Prop. 4 [similarly for (8) and (19)].

Appendix E. Proof of Proposition 5

We will utilize the convergence claims in (He and Yuan, 2015). First we establish that the constraint set of (18) is strictly feasible (Slater s condition). Consider a set of nonnegative kernel coefﬁcients values {α b}B b=1 such that αj b = βj,j b = α b for j Nj and j = 1, . . . , P that satisfy PB b=1 α b = 1, as well as the equality constraints involving αj b and βj,j b in (18). Let the eigenvalue decomposition PB b=1 α b Ab x = Ux,BΛBUT x,B and consider factor matrix

M = Ux,B[:, 1 : Q]Λ1/2 B,Q, where Ux,B[:, 1 : Q] corresponding to the ﬁrst Q principal eigenvectors

in UB with ΛB,Q containing the corresponding principal eigenvalues. Note that PB b=1 α b Ab x M (M )T = PB b=1 α b Ab x Ux,B[:, 1 : Q]ΛB,QUx,B[:, 1 : Q]T is a positive semideﬁnite matrix,

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

thus PB b=1 α b Ab x M (M )T which further implies that G1,j 0 is satisﬁed for j = 1, . . . , P in (18). Further, setting Zj,j = Mj j,: = Mj j,: = M j,: for j = 1, . . . , P and j Nj ensures that the last two equalities in (18) are satisﬁed. For the coefﬁcient values {α b}B b=1 and factor M we can choose values ψ j , θ j that satisfy Gτ,2,j(Nτ) 0 and Gτ,3,j(Nτ) 0 for j = 1, . . . , P. By setting w j = {Qth largest eigenvalue of [M Nj(M Nj)T } 0 and sufﬁciently small θ j the LMI [H2 τ(Nτ)]Nj w j IQ for j SQ is satisﬁed. Thus, the solution set of (18) is non-empty. The x variables in (He and Yuan, 2015, Eqn. (1.1)) correspond to factor vectors Mj j,:, {Mj j,:}j Nj,

coefﬁcients {αj b}B b=1, variables wj, ψj and θj for j = 1, . . . , P. The y variables in (Eqn. (1.1) He and Yuan, 2015) include the auxiliary variates Zj,j and βj,j b for j = 1, . . . , p and j Nj. The equality constraint A x + B y = 0 in (Eqn. (1.1) He and Yuan, 2015) corresponds to the last four equality constraints in (18). Further, convex set X corresponds to the convex LMIs in (18) and scalar inequality constraints, as well as convex linear equality P b αj b = 1 for j = 1, . . . , P. Set Y in (Eqn. (1.1) He and Yuan, 2015) is the entire vector space RQ 1 in which the Zj,j vectors lie which by deﬁnition is convex. Finally, θ1(x) in (Eqn. (1.1) He and Yuan, 2015) corresponds to the convex cost function in (18), and θ2(y) = 0. Thus, the convergence claims follow as a direct application of (Thm. 6.1 He and Yuan, 2015) since (18) was proved to be a special case of the family of problems considered in (Eqn. (1.1) He and Yuan, 2015). The exact same line of arguments can be applied for (19) to show that is a special case of the formulation in (Eqn. (1.1) He and Yuan, 2015).

Hyperspectral remote sensing scenes. Available: http://www.ehu.eus/ccwintco/ index.php?title=Hyperspectral_Remote_Sensing_Scenes, 2021.

Dimitri Bertsekas and John Tsitsiklis. Parallel and distributed computation: numerical methods. Athena Scientiﬁc, 2015.

Pantelis Bouboulis, Symeon Chouvardas, and Sergios Theodoridis. Online distributed learning over networks in rkh spaces using random fourier features. IEEE Transactions on Signal Processing, 66(7):1920 1932, 2017.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine learning, 3(1):1 122, 2011.

Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence, 33(8):1548 1560, 2010.

Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. Communication-optimal distributed clustering. Advances in Neural Information Processing Systems, 29, 2016.

Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research, 13(1):795 828, 2012.

Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001.

Pouya M Ghari and Yanning Shen. Online multi-kernel learning with graph-structured feedback. In International Conference on Machine Learning, pages 3474 3483. PMLR, 2020.

Michael Grant and Stephen Boyd. Cvx: Matlab software for disciplined convex programming, version 2.1, 2014.

Bingsheng He and Xiaoming Yuan. On non-ergodic convergence rate of douglas rachford alternating direction method of multipliers. Numerische Mathematik, 130(3):567 577, 2015.

Steven CH Hoi, Rong Jin, Peilin Zhao, and Tianbao Yang. Online multiple kernel classiﬁcation. Machine learning, 90:289 316, 2013.

Songnam Hong and Jeongmin Chae. Distributed online learning with multiple kernels. IEEE Transactions on neural networks and learning systems, 2021.

Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

Kejun Huang, Nicholas D Sidiropoulos, and Ananthram Swami. Non-negative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition. IEEE Transactions on Signal Processing, 62(1):211 224, 2013.

Rong Jin, Steven CH Hoi, and Tianbao Yang. Online multiple kernel learning: Algorithms and mistake bounds. In Algorithmic Learning Theory: 21st International Conference, ALT 2010, Canberra, Australia, October 6-8, 2010. Proceedings 21, pages 390 404. Springer, 2010.

Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955.

Shi Li and Xiangyu Guo. Distributed k-clustering for data with heavy noise. Advances in Neural Information Processing Systems, 31, 2018.

Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2): 129 137, 1982.

Akshay Malhotra and Ioannis D Schizas. On unsupervised simultaneous kernel learning and data clustering. Pattern Recognition, 108:107518, 2020.

Daniela Micucci, Marco Mobilio, and Paolo Napoletano. Unimib shar: A dataset for human activity recognition using acceleration data from smartphones. Applied Sciences, 7(10):1101, 2017.

Yuichi Motai. Kernel association for classiﬁcation and prediction: A survey. IEEE transactions on neural networks and learning systems, 26(2):208 223, 2014.

Klaus-Robert M uller, Sebastian Mika, Koji Tsuda, and Koji Sch olkopf. An introduction to kernelbased learning algorithms. In Handbook of neural network signal processing, pages 4 1. CRC Press, 2018.

DISTRIBUTED KERNEL-DRIVEN DATA CLUSTERING

Gabriele Oliva, Roberto Setola, and Christoforos N Hadjicostis. Distributed k-means algorithm. ar Xiv preprint ar Xiv:1312.4176, 2013.

Zhengxiao Peng, Yun Li, and Gang Hao. The research on distributed fusion estimation based on machine learning. IEEE Access, 8:38174 38184, 2020.

Jiahu Qin, Weiming Fu, Huijun Gao, and Wei Xing Zheng. Distributed k-means algorithm and fuzzy c-means algorithm for sensor networks based on multiagent consensus theory. IEEE transactions on cybernetics, 47(3):772 783, 2016.

Zhenwen Ren and Quansen Sun. Simultaneous global and local graph structure preserving for multiple kernel clustering. IEEE transactions on neural networks and learning systems, 32(5): 1839 1851, 2020.

Zhenwen Ren, Mithun Mukherjee, Mehdi Bennis, and Jaime Lloret. Multikernel clustering via non-negative matrix factorization tailored graph tensor over distributed networks. IEEE Journal on Selected Areas in Communications, 39(7):1946 1956, 2020.

Ban-Sok Shin, Masahiro Yukawa, Renato Luis Garrido Cavalcante, and Armin Dekorsy. Distributed adaptive learning with multiple kernels in diffusion networks. IEEE Transactions on Signal Processing, 66(21):5505 5519, 2018.

Pham Dinh Tao and LT Hoai An. Convex analysis approach to dc programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289 355, 1997.

George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Bj orn W Schuller. A deep matrix factorization method for learning attribute representations. IEEE transactions on pattern analysis and machine intelligence, 39(3):417 429, 2016.

Nikolaos Tsapanos, Anastasios Tefas, Nikolaos Nikolaidis, and Ioannis Pitas. A distributed framework for trimmed kernel k-means clustering. Pattern recognition, 48(8):2685 2698, 2015.

Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475, 2001.

Lieven Vandenberghe, V Ragu Balakrishnan, Ragnar Wallin, Anders Hansson, and Tae Roh. Interior-point algorithms for semideﬁnite programming problems derived from the kyp lemma. Positive polynomials in control, pages 195 238, 2005.

Saeed V Vaseghi. Advanced digital signal processing and noise reduction. John Wiley & Sons, 2008.

Yu-Xiong Wang and Yu-Jin Zhang. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on knowledge and data engineering, 25(6):1336 1353, 2012.

Penghang Yin, Yifei Lou, Qi He, and Jack Xin. Minimization of ℓ1 2 for compressed sensing. SIAM Journal on Scientiﬁc Computing, 37(1):A536 A563, 2015.