# decentralized_dictionary_learning_over_timevarying_digraphs__c5b9577f.pdf

Journal of Machine Learning Research 20 (2019) 1-62 Submitted 9/17; Revised 8/19; Published 9/19

Decentralized Dictionary Learning Over Time-Varying Digraphs

Amir Daneshmand adaneshm@purdue.edu Ying Sun sun578@purdue.edu Gesualdo Scutari gscutari@purdue.edu School of Industrial Engineering Purdue University West-Lafayette, IN, USA

Francisco Facchinei facchinei@diag.uniroma1.it Department of Computer, Control, and Management Engineering University of Rome La Sapienza Rome, Italy

Brian M. Sadler brian.m.sadler6.civ@mail.mil U.S. Army Research Laboratory Adelphi, MD, USA

Editor: Sathiya Keerthi

This paper studies Dictionary Learning problems wherein the learning task is distributed over a multi-agent network, modeled as a time-varying directed graph. This formulation is relevant, for instance, in Big Data scenarios where massive amounts of data are collected/stored in diﬀerent locations (e.g., sensors, clouds) and aggregating and/or processing all data in a fusion center might be ineﬃcient or unfeasible, due to resource limitations, communication overheads or privacy issues. We develop a uniﬁed decentralized algorithmic framework for this class of nonconvex problems, which is proved to converge to stationary solutions at a sublinear rate. The new method hinges on Successive Convex Approximation techniques, coupled with a decentralized tracking mechanism aiming at locally estimating the gradient of the smooth part of the sum-utility. To the best of our knowledge, this is the ﬁrst provably convergent decentralized algorithm for Dictionary Learning and, more generally, bi-convex problems over (time-varying) (di)graphs. Keywords: Decentralized algorithms, dictionary learning, directed graph, non-convex optimization, time-varying network

1. Introduction and Motivation

This paper introduces, analyzes, and tests numerically the ﬁrst provably convergent distributed method for a fairly general class of Dictionary Learning (DL) problems. More speciﬁcally, we study the problem of ﬁnding a matrix D RM K (a.k.a. the dictionary), by which the data matrix S RM N can be represented through a matrix X RK N, with a favorable structure on D and X (e.g., sparsity). We target scenarios where computational resources and data are not centrally available, but distributed over a group of I agents, which can communicate through a (possibly) time-varying, directed network; see Fig. 1.

c 2019 Amir Daneshmand, Ying Sun, Gesualdo Scutari, Francisco Facchinei, Brian M. Sadler.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-534.html.

Daneshmand, Sun, Scutari, Facchinei, Sadler

Figure 1: Directed network topology

Each agent i {1, 2, . . . , I} owns one block Si RM ni of the data S [S1, . . . , SI], with PI i=1 ni = N. Partitioning the representation matrix X [X1, . . . , XI] according to S, with Xi RK ni, the class of distributed DL problems we aim at studying reads

min D,{Xi}I i=1 U(D, U)

i=1 fi(D, Xi)

| {z } F(D,X)

i=1 gi(Xi) + G(D)

s.t. D D, Xi Xi, i = 1, . . . , I,

where fi : D Xi R is the ﬁdelity function of agent i, which measures the mismatch between the data Si and the (local) model; this function is assumed to be smooth and biconvex (i.e., convex in D for ﬁxed Xi, and vice versa); G : D R and gi : Xi R are (possibly non-smooth) convex functions, which are generally used to impose extra structure on the solution (e.g., low-rank or sparsity); and D RM K and Xi RK ni are some closed convex sets. To avoid scaling ambiguity in the model, D is assumed to be bounded, without loss of generality. Since all fi s share the common variable D, we call it a shared variable and, by the same token, Xi s are termed private variables. Note that, in this distributed setting, agent i knows only its own functions fi (and gi) but not P j =i fj. Hence, agents aim to cooperatively solve Problem P leveraging local communications with their neighbors. Problem P encompasses several DL-based formulations of practical interest, corresponding to diﬀerent choices of the ﬁdelity functions, regularizers, and feasible sets; examples include the elastic net (Zou and Hastie, 2005) sparse DL, sparse PCA (Shen and Huang, 2008), non-negative matrix factorization and low-rank approximation (Hastie et al., 2015), supervised DL (Mairal et al., 2008), sparse singular value decomposition (Lee et al., 2010), non-negative sparse coding (Hoyer, 2004), principal component pursuit (Cand es et al., 2011), robust non-negative sparse matrix factorization, and discriminative label consistent learning

Decentralized Dictionary Learning Over Time-Varying Digraphs

(Jiang et al., 2011). More details on explicit customizations of the general model P can be found in Sec. 2. Our distributed setting is motivated by several data-intensive applications in several ﬁelds, including signal processing and machine learning, and network systems (such as clouds, cluster computers, networks of sensor vehicles, or autonomous robots) wherein the sheer volume and spatial/temporal disparity of scattered data, energy constraints, and/or privacy issues, render centralized processing and storage infeasible or ineﬃcient. Also, time-varying communications arise, for instance, in mobile wireless networks (e.g., ad-hoc networks), wherein nodes are mobile and/or communicate through fading channels. Moreover, since nodes generally transmit at diﬀerent power and/or communication channels are not symmetric, directed links are a natural assumption. Our goal is to design a provably convergent decentralized method for Problem P, over time-varying and directed graphs. To the best of our knowledge this is an open problem, as documented next.

1.1. Challenges and related works

The design of distributed algorithms for P faces the following challenges:

(i) Problem P is non-convex and non-separable in the optimization variables;

(ii) Each agent i owns exclusively Si and thus can only compute its own function fi;

(iii) Each fi depends on a common set of variables the dictionary D shared among all the agents, as well as the private variables Xi. Shared and private variables need to be treated diﬀerently. In fact, in several applications, the size of private variables is much larger than that of the shared ones; hence, broadcasting agents private variables over the network would result in an unaﬀordable communication overhead;

(iv) The gradient of each fi is in general neither bounded nor globally Lipschitz on the feasible region. This represents a challenge in the design of provably convergent distributed algorithms, as boundedness and Lipschitzianity of the gradient are standard assumptions in the analysis of most distributed schemes for nonconvex problems;

(v) G and gi s are nonsmooth;

(vi) The graph is directed, time-varying; no other structure is assumed (such as star or ring topology, etc.), but some long term connectivity properties (cf. Assumption B).

Centralized methods for the solution of Problem P (or some closely related variants) have been extensively studied and prominent examples are (Aharon et al., 2006; Mairal et al., 2010; Razaviyayn et al., 2014b). However, we are not aware of any distributed algorithm that can address challenges i)-vi) (even some subsets of them), as documented next.

Ad-hoc heuristics: Several attempts have been made to extend centralized approaches for DL problems to a distributed setting (undirected, static graphs), under more or less restrictive assumptions; examples include primal methods (Raja and Bajwa, 2013; Chainais and Richard, 2013; Wai et al., 2015) and (primal/)dual-based ones (Chen et al., 2015; Liang et al., 2014; Chouvardas et al., 2015). While these schemes represent good heuristics, their

Daneshmand, Sun, Scutari, Facchinei, Sadler

theoretical convergence remains an open question, and numerical results are contradictory. For instance, some schemes are shown not to converge while some others fail to reach asymptotic agreement among the local copies of the dictionary; see, e.g. (Chainais and Richard, 2013). Recently and independently from our conference work (Daneshmand et al., 2016), Zhao et al. (2016) proposed a distributed primal-dual-based method for a class of dictionary learning problems related, but diﬀerent from Problem P. Speciﬁcally, they considered: quadratic loss functions fi, with a quadratic regularization on the dictionary (i.e., G = 0), and norm ball constraints on the private variables. The network is modeled as a ﬁxed undirected graph. Asymptotic convergence of the scheme to stationary solutions is proved, but no rate analysis is reported. We remark that the scheme in (Zhao et al., 2016), in order to establish convergence, requires some penalty parameters to go to inﬁnity, which makes the method numerically not attractive.

Distributed nonconvex optimization: Since the DL problem P is an instance of nonconvex optimization problems, we brieﬂy discuss here the few works in the literature on distributed methods for non-convex optimization (Bianchi and Jakubowicz, 2013; Tatarenko and Touri, 2017; Wai et al., 2017; Di Lorenzo and Scutari, 2016; Sun et al., 2016; Hong et al., 2017; Scutari and Sun, 2019); we group these papers as follows. The schemes in (Bianchi and Jakubowicz, 2013; Tatarenko and Touri, 2017; Wai et al., 2017; Hong et al., 2017), while substantially diﬀerent, are all applicable to smooth, unconstrained optimization, with (Bianchi and Jakubowicz, 2013; Wai et al., 2017) handling also compact constraints and (Tatarenko and Touri, 2017) implementable on (time-varying) digraphs. The distributed algorithms in (Di Lorenzo and Scutari, 2016; Sun et al., 2016; Scutari and Sun, 2019) can handle objectives with additive nonsmooth convex functions, with (Sun et al., 2016; Scutari and Sun, 2019) applicable to (time-varying) digraphs. All the above schemes cannot adequately deal with private (i.e., Xi s) and shared variables (i.e., D), which are a key feature of Problem P. Furthermore, convergence therein is proved under the assumption that the gradient of (the smooth part of) the objective function is globally Lipschitz continuous, a property that we do not assume and that is not satisﬁed in many of the applications we consider. The design of provably convergent distributed algorithms for P remains an open problem, let alone rate guarantees.

1.2. Major contributions

In this paper, we propose the ﬁrst provably convergent distributed algorithm for the general class of DL problems P, addressing all challenges i)-vi). The proposed approach uses a general convexiﬁcation-decomposition technique that hinges on recent (centralized) Successive Convex Approximation methods (Scutari et al., 2014; Facchinei et al., 2015). This technique is coupled with a perturbed push-sum consensus scheme preserving the feasibility of the iterates and a tracking mechanism aiming at estimating locally the gradient of P i fi. Both communication and tracking protocols are implementable on time-varying undirected or directed graphs (B-strongly connected). The scheme is proved to converge to stationary solutions of Problem P, under mild assumptions on the step-size employed by the algorithm; a sublinear convergence rate is also established. On the technical side, we contribute to the literature of distributed algorithms for bi-convex (nonsmooth) constrained optimization by

Decentralized Dictionary Learning Over Time-Varying Digraphs

putting forth a new non-trivial convergence analysis that, for the ﬁrst time, i) avoids the assumption that the gradients fi are globally Lipschitz; and ii) deals with private and shared optimization variables. Numerical experiments show that the proposed schemes compare favorably with ad-hoc algorithms, proposed for special instances of Problem P.

1.3. Paper Organization

The rest of the paper is organized as follows. The problem and network setting are introduced in Sec. 2, along with some motivating applications. Sec. 3 presents the algorithm and its convergence properties; the proofs of our results are given in the Appendix, Sec. A. Extensive numerical experiments showing the eﬀectiveness of the proposed scheme are discussed in Sec. 5 whereas Sec. 6 draws some conclusions.

1.4. Notation

Throughout the paper we use the following notation. We denote by Rn + and N+ the nonnegative orthant and the set of non-negative integers, respectively. Given x R, x (resp. x ) denotes the smallest (resp. the largest) integer greater (resp. smaller) than or equal to x. Vectors are denoted by bold lower-case letters (e.g., x) whereas matrices are denoted by bold capital letters (e.g., A). The k-th canonical vector is denoted by ek. The inner product between two real matrices, A and B, is denoted by A, B tr(A B), where tr( ) is the trace operator; A B denotes the Kronecker product. Given the real matrix A, with ij-entries denoted by Aij, we will use the following matrix norms: the

Frobenius norm ||A||F q P i,j |Aij|2; the L1,1 norm ||A||1,1 P i,j |Aij|; the L2, norm

||A||2, maxi q P j A2 ij; the L , norm ||A|| , = maxi,j |Aij|; and the spectral norm

||A||2 σmax(A), where σmax(A) denotes the maximum singular value of A. The matrix quantities Dfi(D, Xi) and Xifi(D, Xi) are the gradients of fi with respect to D and Xi, evaluated at (D, Xi), respectively, with the partial derivatives arranged according to the patterns of D and Xi, respectively. The same convention is adopted for subgradients of gi and G, that are therefore written as matrices of the same dimensions of Xi and D, respectively. Table 1 summarizes the main notation and symbols used in the paper.

Because of the nonconvexity of Problem P, we aim at computing stationary solutions of P, deﬁned as follows: a tuple (D , X ), with X [X 1, . . . , X I] is a stationarity solution of P if the following holds: D D, X i Xi, i = 1, . . . , I, and

D DF(D , X ), D D E + G(D) G(D ) 0, D D,

Xifi(D , X i ), Xi X i + gi(Xi) gi(X i ) 0, Xi Xi, i = 1, . . . , I. (2)

2. Problem Setup and Motivating Examples

In this section, we ﬁrst discuss the assumptions underlying our model and then provide several examples of possible applications.

Daneshmand, Sun, Scutari, Facchinei, Sadler

Symbol Deﬁnition Member of Reference

F(D, X) PI i=1 fi(D, Xi) RM K RK N R (P)

U(D, X) F(D, X) + PI i=1 gi(Xi) + G(D) RM K RK N R (P)

Si Local data matrix RM ni

S [S1, S2, . . . , SI] RM N

D Dictionary matrix variable D RM K

D(i) Local copy of D of agent i D RM K

Dν (i) D(i) at iteration ν D RM K

Dν [Dν (1), Dν (2), . . . , Dν (I)] RM KI (25) e Dν (i) Solution of subproblem (8) D RM K (8)

D ν (1/I) PI i=1 Dν (i) D RM K (25) Uν (i) Local update of dictionary variable D RM K (10)

Xi Local matrix variable Xi RK ni

X [X1, X2, . . . , XI] X RK N

Xν i Xi at iteration ν Xi RK ni

Xν X at iteration ν : [Xν 1, Xν 2, . . . , Xν I ] X RK N (25) Θν (i) Gradient-tracking variable D RM K (13)

Aν (aν ij)I i,j=1 consensus weights at time ν RI I Assumption F

Table 1: Table of notation

2.1. Problem Assumptions

We consider Problem P under the following assumptions.

Assumption A (On Problem P)

(A1) Each fi : O Oi R is C2, lower bounded, and biconvex, where O D and Oi Xi are convex open sets;

(A2) Given D D, each Xifi(D, ) is Lipschitz continuous on Xi, with Lipschitz constant L Xi(D). Furthermore, each L Xi : D R+ is continuous;

(A3) D is compact and convex; and each Xi is closed and convex (not necessarily bounded);

(A4) G : O R is convex (possibly non-smooth);

(A5) For all i = 1, . . . , I, either i) Xi is compact and gi : Oi R is convex; or ii) gi is µi-strongly convex.

The above assumptions are quite mild and are satisﬁed by several problems of practical interest; see Sec. 2.2 for several concrete examples.

Network topology

We study Problem P in the following network setting. Time is slotted and in each time-slot ν the network of the I agents is modeled as a digraph Gν = (V, Eν), where V = {1, . . . , I} is the set of agents and Eν is the set of edges (communication links); we use (i, j) Eν to indicate that there is a directed link from node i to node j. The in-neighborhood of agent

Decentralized Dictionary Learning Over Time-Varying Digraphs

Figure 2: Illustration of in-neighborhood set of agent i at time ν.

i V at time ν is deﬁned as N in i [ν] = {j V|(j, i) Eν} {i} (see Fig. 2) whereas its out-neighborhood is N out i [ν] = {j V| (i, j) Eν} {i}. In words, agent i can receive information from its in-neighborhood members, and send information to its out-neighbors. The out-degree of agent i is deﬁned as dν i N out i [ν] , where | | denotes the cardinality of a set. If the graph is undirected, the set of in-neighbors and out-neighbors coincide; in such a case we just write Ni to denote the set of neighbors of agent i. When the network is static, all the above quantities do not depend on the iteration index ν; hence, we will drop the superscript ν . To let information propagate over the network, we assume that the sequence {Gν}ν possesses some long-term connectivity property, as stated next.

Assumption B (B-strong connectivity) The graph sequence {Gν}ν is B-strongly connected, i.e., there exists an (arbitrarily large) integer B > 0 (unknown to the agents) such that the graph with edge set (k+1)B 1 t=k B Et is strongly connected, for all k 0.

Notice that this condition is quite mild and widely used in the literature to analyze convergence of distributed algorithms over time-varying networks. Generally speaking, it permits strong connectivity to occur over time windows of length B, so that information can propagate from every node to every other node in the network. Assumption B is satisﬁed in several practical scenarios. For instance, commonly used settings in cloud computing infrastructures are star, ring, tree, hypercube, or n-dimensional mesh (Torus) topologies, which all satisfy Assumption B. It is worth mentioning that the multi-hop network topologies of these structures are migrating towards high-radix mesh and Torus, since they are scalable, lowenergy consuming, and much cheaper than other topologies, like fat-tree topologies (Kim, 2008). These type of connected networks are generally time-invariant and undirected, and clearly they satisfy Assumption B.

2.2. Motivating examples

We conclude this section discussing some practical instances of Problem P, all satisfying Assumption A, which show the generality of the proposed model.

Elastic net sparse DL (Tosic and Frossard, 2011; Zou and Hastie, 2005)

Sparse approximation of a signal with an adaptive dictionary is one of the most studied DL problems (Tosic and Frossard, 2011). When an elastic net sparsity-inducing regularizer is

Daneshmand, Sun, Scutari, Facchinei, Sadler

used (Zou and Hastie, 2005), the problem can be written as

min D,{Xi}I i=1

2 Si DXi 2 F + λ Xi 1,1 + µ

s.t. D D, Xi RK ni, i = 1, 2, . . . , I,

where D {D : ||Dek||2 α, k = 1, 2, . . . , K}, and α, λ, µ > 0 are the tuning parameters. Problem (3) is an instance of P, with fi(D, Xi) = (1/2) Si DXi 2 F , gi(Xi) = λ Xi 1,1+ µ

2 Xi 2 F , G(D) = 0, and Xi = RK ni. It is not diﬃcult to check that (3) satisﬁes Assumption A, and the Lipschitz constant in A2 is given by L Xi(D) = (σmax(D))2.

Supervised DL (Mairal et al., 2008)

Consider a classiﬁcation problem with training set {sn, yn}N n=1, where sn is the feature vector with associated binary label yn. The discriminative DL problem aims at simultaneously learning a dictionary D(1) RM K such that sn = D(1)xn, for some sparse xn RK, and ﬁnding a bilinear classiﬁer ζn(D(2), xn, sn) s n D(2) xn that best separates the coded data with distinct labels (Mairal et al., 2008). Assume that each agent i owns {(sn, yn) : n Si}, with {Si}I i=1 being a partition of {1, . . . , N}, then the discriminative DL reads

min D(1),D(2)

ℓ ynζn D(2), xn, sn + 1

sn D(1)xn 2

s.t. D(1) D(1), D(2) D(2), xn RK, n = 1, 2, . . . , N,

where ℓ(x) log(1 + e x) is the logistic loss function; and gn (xn) λ xn 1 + (µ/2) xn 2 2 is the elastic net regularizer. The dictionary D(1) and classiﬁer parameter D(2) are constrained to belong to the convex compact sets D(1) and D(2), respectively. Problem (4) is an instance of Problem P, with D D(1), D(2) , Si [sn]n Si and Xi [xn]n Si. Note that Assumption A is satisﬁed, and the Lipschitz constant in A2 is given by L Xi(D) = (1/4) yi s i D(2) 2 2 + (σmax(D(1)))2.

DL for low-rank plus sparse representation (Bouwmans et al., 2017)

The low-rank plus sparse decomposition problems cover many applications in signal processing and machine learning (Bouwmans et al., 2017), including matrix completion, image denoising, deblurring, superresolution, and Principal Component Pursuit (PCP) (Cand es et al., 2011). Consider the bi-linear model S L + HQU: the data matrix S is decomposed as the superposition of a low-rank matrix L (capturing the correlations among data) and HQU, where Q RM K is an over-complete dictionary (capturing the representative modes of the data), U R K N is a sparse matrix (representing the data parsimoniously), and H is a given degradation matrix, which accounts for tasks such as denoising, superresolution, and deblurring. To enforce L to be low-rank, we employ the nuclear norm L regularizer, which can be equivalently rewritten as L = inf{1

2 P 2 F + 1

2 V 2 F : L = PV}, where P RM L, V RL N, and L min(M, N) (Srebro and Shraibman, 2005; Recht

Decentralized Dictionary Learning Over Time-Varying Digraphs

et al., 2010). Partitioning V and U according to S, i.e., Si = PVi + HQUi, the problem reads

min P,Q,(Vi,Ui)I i=1

Si [P HQ] Vi Ui

P 2 F + I Vi 2 F + λ Xi 1,1 + µ

s.t. D D, Xi R(L+ K) ni, i = 1, 2, . . . , I,

where D is some compact set; ζ > 0 is a constant used to promote the low-rank structure on L while sparsity on X is enforced by the elastic net regularization, with constants λ, µ > 0. Problem (5) is clearly an instance of Problem P wherein fi is the quadratic loss, and [P, Q] and [V i , U i ] are the shared and private variables (K = L+ K), respectively. Assumption A is satisﬁed, and the Lipschitz constant in A2 is given by L Xi(D) = (σmax(PHQ))2. A variant of this problem, which still is a particular case of Problem (5), is obtained by replacing the quadratic loss function with the smoothed Huber function to achieve robustness against outliers (Aravkin et al., 2014).

Sparse SVD/PCA (Lee et al., 2010; Udell et al., 2016; Mairal et al., 2010)

Computing the SVD of a set of data with sparse singular vectors (Sparse SVD) is the foundation of many applications in multivariate analysis, e.g., biclustering (Lee et al., 2010). As proposed in (Mairal et al., 2010), Problem P can be used to accomplish this task by imposing sparsity on the factors D and X of S. More speciﬁcally, we have

min D,(Xi)I i=1

2 Si DXi 2 F + λX Xi 1,1 + µX

+ λD D 1,1 + µD

s.t. D D {D RM K : ||D||2, α}, Xi RK ni, i = 1, 2, . . . , I, (6) where λD, λX, µD, µX, α > 0 are given constants. Problem (6) is an instance of P, with fi(D, Xi) = (1/2) ||Si DXi||2 F ; G(D) = λD ||D||1,1 + (µD/2) ||D||2 F , and gi(Xi) = λX ||Xi||1,1 + (µX/2) ||Xi||2 F . Note that orthonormality of factors are relaxed for sake of simplicity. A related formulation, termed Sparse PCA, has also been used in (Udell et al., 2016). It is not diﬃcult to show that Assumption A is satisﬁed, and the Lipschitz constant in A2 is given by L Xi(D) = (σmax(D))2.

Non-negative Sparse Coding (NNSC) (Hoyer, 2004)

Non-negative Matrix Factorization (NMF) was primarily proposed by (Lee and Seung, 1999) as a better alternative to the classic SVD in learning localized features of image datasets, such as face images. The formulation enforces non-negativity of the entries of D and X. This has been shown to empirically lead to sparse solutions; however no explicit control on sparsity is employed in the model. To overcome this shortcoming, (Hoyer, 2004) proposed a non-negative sparse coding (NNSC) formulation which extends NMF by adding a sparsity-

Daneshmand, Sun, Scutari, Facchinei, Sadler

inducing penalty function of X. The problem reads

min D,(Xi)I i=1

2 Si DXi 2 F + λ Xi 1,1 + µ

s.t. D D {D RM K + | ||D||2, α}, Xi RK ni + , i = 1, 2, . . . , I,

for some λ, µ, α > 0. Problem (7) is another instance of P, with fi(D, Xi) = (1/2) ||Si DXi||2 F , gi(Xi) = λ Xi 1,1 + (µ/2) Xi 2 F , G(D) = 0, D = {D RM K + | ||D||2, α}, and Xi = RK ni + . Assumption A is satisﬁed, and the Lipschitz constant in A2 is given by L Xi(D) = (σmax(D))2.

3. Algorithmic Design

We introduce now our algorithmic framework. To shed light on the core idea behind the proposed scheme, we begin introducing an informal and constructive description of the algorithm, followed by its formal description along with its convergence properties. Each agent i controls its private variable Xi and maintains a local copy of the shared variables D, denoted by D(i), along with an auxiliary variable Θ(i); we anticipate that Θ(i) aims at locally estimating the gradient sum P j Dfj(D(i), Xj), an information that is not available at agent i s side. The value of these variables at iteration ν is denoted by Xν i , Dν (i), and Θν (i), respectively. Roughly speaking, the update of these variables is designed so that asymptotically i) all the D(i) will be consensual, i.e., D(i) = D(j), i = j; and ii) the tuples (D(i), (Xj)I j=1) will be a stationary solutions of Problem P. This is accomplished throughout the following two steps, which are performed iteratively and in parallel across the agents.

Step 1: Local Optimization

The nonconvexity of fi together with the lack of knowledge of P j =i fj in F prevents agent i to solve directly Problem P with respect to (D(i), Xi). Since fi is bi-convex in (D(i), Xi), a natural approach is then to update D(i) and Xi in an alternating fashion by solving a local approximation of P. Speciﬁcally, at iteration ν, given the iterates Xν i , Dν (i), and Θν (i), agent i ﬁxes Xi = Xν i and solves the following strongly convex problem in D(i) :

e Dν (i) argmin D(i) D fi D(i); Dν (i), Xν i + D I Θν (i) Dfi(Dν (i), Xν i ), D(i) Dν (i) E +G D(i) , (8)

where fi( ; Dν (i), Xν i ) is a suitably chosen strongly convex approximation of fi( , Xν i ) at (Dν (i), Xν i ) (cf. Assumption C, Sec. 3.1); and Θν (i), as anticipated, is used to track the

gradient of F, with limν I Θν (i) PI j=1 Dfj(Dν (i), Xν j ) = 0; which would lead to

I Θν (i) Dfi(Dν (i), Xν i ) X

j =i Dfj(Dν (i), Xν j )

This sheds light on the role of the linear term in (8): it can be regarded as a proxy of the sum-gradient P j =i Dfj(Dν (i), Xν j ), which is not available at agent i s side. In Step 2 below we show how to update Θν (i) using only local information, so that (9) holds.

Decentralized Dictionary Learning Over Time-Varying Digraphs

Given e Dν (i), a step-size is employed in the update of D(i), generating the iterate Uν (i):

Uν (i) = Dν (i) + γν( e Dν (i) Dν (i)), (10)

where γν is the step-size, to be properly chosen (see Assumption E, Sec. 3.1). Let us now consider the update of the private variables Xi. Fixing D(i) = Uν (i), agent

i computes the new update Xν+1 i by solving the following strongly convex optimization problem: Xν+1 i argmin Xi Xi hi(Xi; Uν (i), Xν i ) + gi(Xi), (11)

where hi( ; Uν (i), Xν i ) is a strongly convex function of Xi, approximating fi(Uν (i), ) at

(Uν (i), Xν i ); see Assumption C (cf. Sec. 3.1) for speciﬁc instances of hi.

Step 2: Local Communications

Let us design now a local communication mechanism ensuring asymptotic consensus over the local copies D(i) s and property (9). To do so, we build on the (perturbed) push-sum protocol proposed in (Sun et al., 2016) (see also Kempe et al. (2003)). Speciﬁcally, an extra scalar variable φi is introduced at each agent s side to deal with the directed nature of the graph; given φν i and Uν (j) from its in-neighbors j Ni, each agent i updates its own local estimate Dν (i) and φν i according to:

j N in i [ν] aν ij φν j and Dν+1 (i) = 1 φν+1 i

j N in i [ν] aν ij φν j Uν (j), (12)

where aν ij s are some weights (to be properly chosen, see Assumption F, Sec. 3.1); and φ0 i = 1, for all i = 1, . . . , I. Note that the updates in (12) can be implemented locally: all agents only need to (i) send their local variable Uν (j) and the scalar weight aν ij φν j to their neighbors; and (ii) collect locally the information coming from the neighbors. To update the Θν (i) variables we leverage the gradient tracking mechanism ﬁrst introduced in (Di Lorenzo and Scutari, 2016), coupled with the push-sum consensus scheme (Sun et al., 2016), resulting in the following perturbed push-sum scheme:

Θν+1 (i) = 1 φν+1 i

j N in i [ν] aν ijφν j Θν (j) + 1 φν+1 i

Dfi(Dν+1 (i) , Xν+1 i ) Dfi(Dν (i), Xν i ) , (13)

with Θ0 (i) Dfi(D0 (i), X0 i ), for all i = 1, . . . , I. The update (13) follows similar logic as that of Dν (i) in (12), with the diﬀerence that (13) contains a perturbation [the second term in the RHS of (13)], which employs Θν (i) and ensures the desired tracking properties (otherwise Θν (i) would converge to the average of their initial values). Note that (13) can be performed locally by agent i, following the same procedure as described for (12). Combining the above steps, we can now formally introduce the proposed distributed algorithm for the DL problems P, as described in Algorithm 1, and termed D4L (Decentralized Dictionary Learning over Dynamic Digraphs) Algorithm.

Daneshmand, Sun, Scutari, Facchinei, Sadler

Algorithm 1 : Decentralized Dictionary Learning over Dynamic Digraphs (D4L)

Initialization : set ν = 0 and φ0 i = 1, D0 (i) D, X0 i Xi, Θ0 (i) = Dfi(D0 (i), X0 i ), for all i = 1, 2, . . . , I. S1. If (Dν (i), Xν i ) satisﬁes a suitable stopping criterion: STOP;

S2. Local Optimization: Each agent i computes:

(a) e Dν (i) and Uν (i) according to (8) and (10);

(b) Xν+1 i according to (11);

S3. Local Communications: Each agent i collects data from its current neighbors and updates:

(a) φν+1 i and Dν+1 (i) according to (12);

(b) Θν+1 (i) according to (13);

S4. Set ν + 1 ν, and go to S1.

3.1. Algorithmic Assumptions

Before stating the main convergence result for the D4L Algorithm, we discuss the main assumptions governing the choices of the free parameters of the algorithm, namely: the surrogate functions fi and hi, the step-size γν, and the consensus weights (aν ij)I i,j=1.

3.1.1. On the choice of fi and hi.

The surrogate functions are chosen to satisfy the following assumption.

Assumption C (On fi and hi) Given Dν (i) and Xν i , fi( ; Dν (i), Xν i ) in (8) is either

fi(D(i); Dν (i), Xν i ) = fi(D(i), Xν i ) + τ ν D,i

2 ||D(i) Dν (i)||2 F , (14)

or fi(D(i); Dν (i), Xν i ) = D Dfi(Dν (i), Xν i ), D(i) Dν (i) E + τ ν D,i

D(i) Dν (i) 2 F , (15)

where τ ν D,i is a positive scalar satisfying Assumption D. Given Uν (i) and Xν i , hi( ; Uν (i), Xν i ) in (11) is either

hi(Xi; Uν (i), Xν i ) fi(Uν (i), Xi) + τ ν X,i

2 Xi Xν i 2 F , (16)

or hi(Xi; Uν (i), Xν i ) = D Xifi(Uν (i), Xν i ), Xi Xν i E + τ ν X,i

Xi Xν i 2 F , (17)

where τ ν X,i is a positive scalar satisfying Assumption D.

Decentralized Dictionary Learning Over Time-Varying Digraphs

Assumption D (On τ ν X,i and τ ν D,i) The parameters (τ ν X,i)I i=1 and (τ ν D,i)I i=1 are chosen such that (D1) {τ ν D,i}ν and {τ ν X,i}ν satisfy 0 < inf ν τ ν D,i sup ν τ ν D,i < + , (18)

and supν τ ν X,i < + ,

2L Xi(Uν (i)) + ϵ, ν 1, (19)

for all i = 1, 2, . . . , I, where ϵ > 0 is an arbitrarily small constant, and L Xi is deﬁned in Assumption A2.

(D2) Stronger convergence results [cf. Theorem 2] can be obtained if, under Assumption A5(ii), the sequences {τ ν D,i}ν and {τ ν X,i}ν, in addition to D1, also satisfy

τ t+1 D,i τ t D,i < , (20)

and lim sup ν

τ ν X,i τ ν 1 X,i < µ, (21)

where µ mini µi and µi is the strongly convexity constant of fi.

Discussion. Several comments are in order. On the choice of fi and hi. Since fi (resp. hi) is convex in D(i) (resp. Xi), (14) [resp. (16)] is a natural choice for the surrogate fi (resp. hi): the structure of fi (resp. hi) is preserved while a quadratic term is added to make the overall surrogate strongly convex. The non-smooth strongly convex subproblems (8) and (11) resulting from (14) and (16) can be solved using standard solvers, e.g., projected subgradient methods. When dealing with large-scale instances, eﬀective methods are also (Facchinei et al., 2015; Daneshmand et al., 2015). The alternative surrogates fi and hi as given in (15) and (17), respectively, are based on the the linearization of the original fi and hi. This option is motivated by the fact that, for speciﬁc instances of fi and hi, (15) and (17) lead to subproblems (8) and (11) whose solution can be computed in closed form. For instance, consider the elastic net sparse DL problem (3) in Sec. 2.2, where fi(D, Xi) = 1

2||Si DXi||2 F ; G(D) = 0; and gi(Xi) = λ ||Xi||1 + µ

2 ||Xi||2 F , with λ, µ > 0. By using (15), the resulting subproblem (8) admits the following closed form solution:

e Dν (i) = PD

Dν (i) I τ ν D,i Θν (i)

Referring to the sparse coding subproblem (11), if hi is chosen according to (16), computing the update Xν+1 i results in solving a LASSO problem. If instead one uses the surrogate in (17), the solution of (11) can be computed in closed form as

Xν+1 i = τ ν X,i µ + τ ν X,i T λ τν X,i

Xν i 1 τ ν X,i Xifi(Uν (i), Xν i )

Daneshmand, Sun, Scutari, Facchinei, Sadler

where T is the soft-thresholding operator Tθ(x) max(|x| θ, 0) sign(x) [with sign( ) denoting the sign function], applied to the matrix argument component-wise.

On the choice of τ ν X,i and τ ν D,i. These coeﬃcients must satisfy Assumption D. Roughly speaking, D1 ensures that (τ ν X,i)I i=1 and (τ ν D,i)I i=1 are bounded (both from below and above) while D2 guarantees that these parameters are asymptotically stable . A trivial choice for τ ν D,i satisfying both (18) and (20) is τ ν D,i = c, for some c > 0; some practical rules for τ ν X,i satisfying both (19) and (21) are the following:

(a) Use a constant τ ν X,i, that is,

τ ν X,i = max D D

max σmax 2 Xifi(D, Xν i ) , ϵ ,

for some ϵ > 0. The above value can be, however, much larger than any σmax( 2 Xifi(Uν (i), Xν i )), which can slow down the practical convergence of the algorithm;

(b) A less conservative choice is to satisfy (19) iteratively, while guaranteeing that τ ν X,i is uniformly positive: τ ν X,i = max(L Xi(Uν (i)), ϵ), (24)

where ϵ is any positive (possibly small) constant;

(c) A generalization of (b) is

τ ν X,i h max(L Xi(Uν (i)), ϵ), L Xi(Uν (i)) + µ i ,

for some ϵ and µ such that 0 < ϵ µ < µ.

Remark 1 While the choices (a)-(c) above clearly satisfy (19), it can be shown that (21) also holds, as a consequence of the continuity of L Xi( ) and Proposition 5 (cf. Appendix A.4).

Note that all the above rules do not require any coordination among the agents, but are implementable in a fully distributed manner, using only local information.

3.1.2. On the choice of γν

The step-size can be chosen according to the following assumption.

Assumption E (On γν) {γν}ν satisﬁes: γν (0, 1], for all ν; P ν=0 γν = ; and P ν=0 (γν)2 < .

The above assumption is the standard diminishing-rule; see, e.g., (Bertsekas and Tsitsiklis, 1997). Here, we only recall one rule, satisfying Assumption E, that we found very eﬀective in our experiments, namely (Facchinei et al., 2015): γν = γν 1(1 ϵ0γν 1) with γ0 (0, 1] and ϵ0 (0, 1/γ0).

Decentralized Dictionary Learning Over Time-Varying Digraphs

3.1.3. On the choice of the weigh coefficients {aν ij}.

We denote by Aν the matrix whose entries are the weights aν ij s, i.e., [Aν]i,j = aν ij. This matrix is chosen so that the following conditions are satisﬁed.

Assumption F (On the weighting matrix) Given the digraph Gν = (V, Eν), each matrix Aν, with [Aν]ij = aν ij, satisﬁes

(F1) aν ii κ > 0 for all i = 1, . . . , I;

(F2) aν ij κ > 0, if (j, i) Eν; and aν ij = 0 otherwise;

(F3) Aν is column stochastic, i.e., 1 Aν = 1 .

When the graph Gν is directed, a valid choice of Aν is (Kempe et al., 2003): aν ij = 1/dν j if j N in i [ν], and aν ij = 0 otherwise, where dν j is the out-degree of agent j at time ν. The resulting communication protocols (12) (13) can be easily implemented in a distributed fashion:each agent i) broadcasts its local variable normalized by its current out-degree; and ii) collects locally the information coming from its neighbors. When the graph is undirected, several options are available in the literature, including: the Laplacian, Metropolis-Hastings, and maximum-degree weights; see, e.g., (Xiao et al., 2005).

4. Convergence of D4L

In this section, we provide the main convergence results for the D4L Algorithm. We begin introducing some deﬁnitions, instrumental to state our results. Let

Dν [Dν (1), Dν (2), . . . , Dν (I)] , Xν [Xν 1, Xν 2, . . . , Xν I], and D ν 1

i=1 Dν (i). (25)

Given the sequence {(Dν, Xν)}ν generated by the D4L Algorithm, convergence is stated measuring the distance of the sequence {(D ν, Xν}ν from optimality as well as the consensus disagreement among the local variables Dν (i) s. Distance from stationarity is measured by

ν max( D(D ν, Xν), X(D ν, Xν)) (26)

D(D ν, Xν) || b D(D ν, Xν) D ν|| , , X(D ν, Xν) || b X(D ν, Xν) Xν|| , , (27)

with the functions b D( , ) and b X( , ) deﬁned as:

b D(D ν, Xν) argmin D D

DF(D ν, Xν), D D ν + ˆτD

D D ν 2 + G(D ), (28)

b X(D ν, Xν) [ b X1(D ν, Xν 1), . . . , b XI(D ν, Xν I)],

with b Xi(D ν, Xν i ) argmin X i Xi

Xifi(D ν, Xν i ), X i Xν i + ˆτX

X i Xν i 2 + gi(X i),

Daneshmand, Sun, Scutari, Facchinei, Sadler

for some given constants ˆτD > 0 and ˆτX > 0. Note that D( , ) and X( , ) are valid merit functions, in the sense that they are continuous and D(D , X ) = X(D , X ) = 0 if and only if (D , X ) is a stationary solution of Problem (P) (Facchinei et al., 2015). The consensus error at iteration ν is measured by the function

eν ||Dν 1 D ν|| , . (30)

Asymptotic convergence of D4L to stationary solutions of (P) is stated in Theorem 2 below while the convergence rate is studied in Theorem 3.

Theorem 2 Given Problem P under Assumption A, let Dν, Xν

ν be the sequence generated by the D4L Algorithm for a given initial point D0, X0 and under Assumptions B, C, D1, E, F. Then,

(a) [Consensus]: All Dν (i) s are asymptotically consensual, i.e., limν eν = 0;

(b) [Convergence]: i) {(D ν, Xν)}ν is bounded; ii) {U(D ν, Xν)}ν converges to a ﬁnite value; iii) limν X(D ν, Xν) = 0; and iv) lim infν D(D ν, Xν) = 0. Therefore, {(D ν, Xν)}ν has at least one limit point which is a stationary solution of P.

If, in particular, Assumption A5(ii) holds and D1 is reinforced by D2, then convergence in (b) can be strengthened as follows:

(b ) Case (b) holds and limν ν = 0, implying that all the limit points of {(D ν, Xν)}ν are stationary solutions of P.

Proof The proof is quite involved and is given in Appendix A.2.

The above theorem states two main convergence results under Assumptions B, C, D1, E, F: i) existence of at least a subsequence of (D ν, Xν) converging to a stationary solution of Problem P; and ii) asymptotic consensus of all Dν (i) to a common value D ν. If Assumption A5(ii) is also assumed and D1 is reinforced by D2 the stronger results in (b ) can be proven, showing that every limit point is a stationary solution. Note that from a practical point of view the weaker result guaranteeing existence of at least a subsequence converging to a stationary solution is perfectly satisfying, since it guarantees that the algorithm can be terminated after a ﬁnite number of iterations with an approximate solution. Theorem 3 Consider either settings of Theorem 2, with the additional assumption that the step-size sequence {γν}ν is non-increasing. For any given ϵ > 0, let TD,ϵ min{ν N+ : D(D ν, Xν) ϵ} and TX,ϵ min{ν N+ : X(D ν, Xν) ϵ}. Then,

(a) [Rate of consensus error]:

eν = O γ θν , (31) for every θ (0, 1);

(b) [Rate of optimization errors]:

Let γν = K/νp with some constant K > 0 and p (1/2, 1). Then,

TD,ϵ = O 1 ϵ2/(1 p)

Decentralized Dictionary Learning Over Time-Varying Digraphs

Figure 3: Topology of a simulated (sparse) network

Proof See Appendix A.3

We remark that, while a convergence rate has been established in the literature (see, e.g., Razaviyayn et al. (2014a)) for certain centralized algorithms applied to special classes of DL problems, Theorem 3 represents the ﬁrst rate result for a distributed algorithm tackling the class of DL problems P.

5. Numerical Experiments

In this section, we test numerically our algorithmic framework on several classes of problems, namely: (i) Image denoising, (ii) Biclustering, (iii) Sparse PCA, and (iv) Non-negative sparse coding. We recall that D4L is the ﬁrst provably convergent distributed algorithm for Problem P; comparisons are thus not simple. To give the sense of the performance of D4L, in our experiments, (i) when available, we implemented, centralized algorithms tailored to the speciﬁc problems under consideration and used the results as benchmarks; (ii) for undirected graphs, we extended the (distributed) Prox-PDA-IP (Zhao et al., 2016) algorithm to the simulated instances of Problem P (generalizations of this method to directed graphs seem not possible); (iii) for both undirected and directed graphs, we implemented a suitable version of the Adapt-Then-Combine (ATC) Algorithm (Chainais and Richard, 2013). Note that ATC has no formal convergence proof, and is originally designed to handle only undirected graphs, but we managed to make a sensible extension of this method to directed graphs too, by using some of the ideas developed in this paper. All codes are written in MATLAB 2016b, and implemented on a computer with Intel Xeon (E5-1607 v3) quad-core 3.10GHz processor and 16.0 GB of DDR4 main memory.

5.1. Image Denoising

Problem formulations: We consider denoising a 512 512 pixels image of a ﬁshing boat (USC, 1997) see Fig. 5(a). We simulate a cluster computer network composed of 150 nodes (computers). Denoting by F0 and F the noise-free and corrupted image, respectively, the SNR (in d B) is deﬁned as SNR 20 log(||vec(F0)||2/

MSE) while the Peak SNR (in d B) is

Daneshmand, Sun, Scutari, Facchinei, Sadler

deﬁned as PSNR 20 log(maxi(vec(F0))i/

MSE), where MSE is the Mean-Squared-Error between F0 and F. The ﬁshing boat image is corrupted by additive white Gaussian noise, so that SNR = 15 d B and PSNR = 20.34 d B. To perform the denoising task, we consider the elastic net sparse DL formulation (3). We extract 255,150 square sliding s s pixel patches (s = 8) and aggregate the vectorized extracted patches in a single data matrix S of size 64 255, 150. The size of the dictionary is s2 s2 = 64 64; the data matrix is equally distributed across the 150 nodes, resulting in sparse representation matrices Xi of size 64 1701 (K = 64 and ni = 1701). The total number of optimization variables is then 16, 333, 696. The free parameters λ, µ and α in (3) are set to λ = 1/s, µ = λ and α = 1, respectively.

Algorithms and tuning: We tested: i) two instances of the D4L Algorithm, corresponding to two alternative choices of the surrogate functions; ii) the Prox-PDA-IP algorithm (Zhao et al., 2016), adapted to problem (3) (only on undirected networks); iii) the ATC algorithm (Chainais and Richard, 2013); and iv) the centralized K-SVD algorithm (Elad and Aharon, 2006) (KSVD-Box v13 package), used as a benchmark. More speciﬁcally, the two instances of the D4L Algorithm are:

Plain D4L: hi is chosen as in (16) (the original function) and fi as in (15);

Linearized D4L: hi is given by (17) (ﬁrst-order approximation) and fi is given by (15).

The rest of the parameters in both instances of D4L is set as: γν = γν 1(1 ϵγν 1), with γ0 = 0.5 and ϵ = 10 2; τ ν D,i = 10; and τ ν X,i = max(L Xi(Uν (i)), 1) [cf. (24)]. Our adaptation of the Prox-PDA-IP algorithm to Problem (3) is summarized in Algorithm 2. The diﬀerence with the original version in (Zhao et al., 2016) are: i) the elastic net penalty is used in the objective function for the Xi s variables, instead of the ℓ1-norm and ℓ2-norm ball constraints; and ii) the variables D(i) s are constrained in D {D : ||Dek||2 α, k = 1, 2, . . . , K} rather than using the ℓ2-norm regularization in the objective function. The other symbols used in Algorithm 2 are: i) the incidence matrix of G, denoted by M = (Mei)e,i RE I, with E |E|; ii) the matrices Ων e RM K, e = 1, ..., E, which are the ν-th iterate of the dual matrix variables Ωe RM K, as introduced in the original Prox-PDA-IP; and iii) {βν}ν N+ is the increasing penalty parameter, set to βν = 0.002ν. All the algorithms are initialized to the same value: D0 (i) s coincide with randomly (uniformly) chosen columns of S(i) s whereas all X0 i s are set to zero. While the subproblems solved at each iteration ν in Linearized D4L admit a closedform see (23) and (22) in both Plain D4L and ATC, the update of the dictionary has the closed form expression (22), but the update of the private variables calls for the solution of a LASSO problem (cf. Sec. 3.1). For both Plain D4L and ATC, the LASSO subproblems at iteration ν are solved using the (sub)gradient algorithm, with the following tuning. A diminishing step-size is used, set to γr = γr 1(1 ϵγr 1), where γ0 = 0.9, ϵ = 10 3, and r denotes the inner iteration index. A warm start is used for the subgradient algorithm: the initial points are set to Xν i , where ν is the iteration index of the outer loop. We terminate

Decentralized Dictionary Learning Over Time-Varying Digraphs

Algorithm 2 : Prox-PDA-IP algorithm (Zhao et al., 2016)

Initialization : D0 (i) D, X0 i Xi, Ω0 = 0; S1. If (Dν (i), Xν i )i satisﬁes stopping criterion: STOP;

S2. Each agent i computes θν i = ||Dν (i)Xν i Si||2 F and:

(a) Xν+1 i = argmin Xi RK ni fi(Dν (i), Xi)+gi(Xi)+ βν+1θν i 2 ||Xi Xν i ||2 F + βν+1

2 ||Dν (i)(Xi Xν i )||2 F ;

(b) Dν+1 (i) = argmin D(i) D fi(D(i), Xν+1 i ) + PE e=1 Mei Ων e, D(i)

+βν+1 di||D(i)||2 F D(i), (di 1)Dν (i) + P j Ni Dν (j) ;

(c) Ων+1 e = Ων e + βν+1 PI i=1 Mei Dν+1 (i) , e = (i, j) E;

S3. Set ν + 1 ν, and go to S1.

the subgradient algorithm in the inner loop when Jr i 10 6, with

Jr i Xν,r i s 1 + s T 1

Xν,r i Xifi(Uν (i), Xν,r i ) + τ ν X,i (Xν,r i Xν i ) , ,

where Xν,r i denotes the value of Xi at the r-th inner iteration and outer iteration ν; and Tθ(x) max(|x| θ, 0) sign(x) is the soft-thresholding operator, applied to the matrix argument componentwise. In all our simulations, we observed that the above accuracy was reached within 30 (inner) iterations of the subgradient algorithm. In the Prox-PDA-IP scheme, Step S2 (cf. Algorithm 2) calls for the solution of two subproblems, including a LASSO problem. As for Plain D4L and ATC, we used the (projected) (sub)-gradient algorithm (with the same diminishing step-size rule) to solve the subproblems; we terminated the inner loop when the length between two consecutive iterates of the (projected) (sub)-gradient algorithm goes for the ﬁrst time below 10 6. We simulated both undirected and directed static graphs. In the former case, there is no need of the φ-variables and, in the second equation of (12) [and (13)], the terms (φν j aν ij)/φν+1 i reduce to aij. The weights aij are chosen according to the Metropolis-Hasting rule (Xiao et al., 2007); the resulting matrix Aν = [aij]ij is thus time-invariant and doubly stochastic. When the graph is directed, we use the update of the φν i s as in (12), with the weights aν ij chosen according to the push-sum protocol (Kempe et al., 2003) (cf. Sec. 3.1.3).

Convergence speed and quality of the reconstruction: In the ﬁrst set of simulations, we considered an undirected graph composed of 150 nodes, clustered in 6 groups of 25 (see Fig. 3). Starting from this topology, we kept adding random edges till a connected graph was obtained. Speciﬁcally, an arc is added between two nodes in the same cluster (resp. diﬀerent clusters) with probability p1 = 0.2 (resp. p2 = 2 10 3). In Fig. 4 we plot the objective function value [subplot on the left], the consensus disagreement eν as in (30) [subplot in the center], and the distance from stationarity ν as in

Daneshmand, Sun, Scutari, Facchinei, Sadler

0 500 1000 10-2

Figure 4: Denoising problem D4L, Prox-PDA-IP and ATC algorithms: objective value [subplot on the left], consensus disagreement [subplot in the center], and distance from stationarity ν [cf. (26)] [subplot on the right] vs. number of message exchanges.

(28) [subplot on the right] versus the number of message exchanges, achieved by Plain D4L, Linearized D4L, Prox-PDA-IP, and ATC. Note that the number of messages exchanged in the ATC algorithm at iteration ν coincides with ν whereas for Prox-PDA-IP and the D4L schemes is 2ν (recall that the latter schemes employ two steps of communications per iteration). The ﬁgures clearly show that both versions of D4L are much faster than Prox-PDA-IP and ATC (or, equivalently, they require fewer information exchanges). Moreover, ATC does not seem to reach a consensus on the local copies of the dictionary, while Prox-PDA-IP and D4L schemes reach an agreement quite soon. In Fig. 5, we plot the reconstructed images along with their PSNR and MSE, obtained by the algorithms, when terminated after 1000 message exchanges. The ﬁgures clearly show superior performance of D4L over its competitors. Also, the values of PSNR and MSE achieved by D4L are comparable with those obtained by (centralized) K-SVD (KSVD-Box v13 package). A closer look at Fig. 4 shows that a signiﬁcant decay on the objective function occurs in the ﬁrst 200 message exchanges. It is then interesting to check the quality of the reconstructed images, achieved by the algorithms if terminated then. In Fig. 6, we report the images and values of PSNR and MSE obtained by terminating the schemes after 200 message exchanges (we also plot the benchmark obtained by K-SVD, run till optimality). The ﬁgure shows that both versions of D4L attain high quality solutions even if terminated after few message exchanges while ATC and Prox-PDA-IP lag behind. This means that, in practice, there is no need to run D4L till very low values of eν and ν are achieved.

Since the algorithms do not have the same cost-per-iteration, to get further insights into the performance of these schemes, we also compare them in terms of running time. In Table 2, we report the averaged elapsed time to execute one iteration of all algorithms. We considered the same setting as in the previous ﬁgures, but we terminated all algorithms after 273 seconds, which corresponds to the time for the fastest algorithm (i.e. Linearized D4L) to perform 200 message exchanges [cf. Fig. 6]. The associated reconstructed images are shown in Fig. 7. Once again, these results clearly show that the linearized D4L scheme signiﬁcantly outperforms Prox-PDA-IP and ATC. Also, Linearized D4L performs considerably better

Decentralized Dictionary Learning Over Time-Varying Digraphs

Figure 5: Denoising outcome. (a): original image; (b): corrupted image; (c)-(f): denoising achieved by D4L, Prox-PDA-IP and ATC terminated after 1000 message exchanges; and (g): denoising achieved by centralized K-SVD (KSVD-Box v13).

than Plain D4L, when terminated early; the explanation is in Table 2 which shows that the time of one iteration of the former algorithm is much shorter than that of Plain D4L.

Algorithm Average Time per Iteration (sec) Linearized D4L 2.862 Plain D4L 11.328 Prox-PDA-IP 30.98 ATC 9.838

Table 2: D4L vs. Prox-PDA-IP and ATC: Average computation time per iteration

Impact of the graph topology and connectivity: We study now the inﬂuence of the topology and graph connectivity on the performance of the algorithms. We consider directed, static graphs. We generated 5 instances of digraphs, with diﬀerent connectivity, according to the following procedure. There are 500 nodes (I = 500), which are clustered in nc = 50 clusters, each of them containing 10 = I/nc nodes. Each node has an outgoing arc to another node in the same cluster with probability p1 while p2 is the probability of an outgoing arc to a node in a diﬀerent cluster. We chose the values of p1 and p2, as in Table 3; we simulated three scenarios, namely: N1 corresponds to a highly connected network, N3 describes a poorly connected scenario, and N2 is an intermediate case. For

Daneshmand, Sun, Scutari, Facchinei, Sadler

Figure 6: Denoising outcome. (a): original image; (b): corrupted image; (c)-(f): denoising achieved by D4L, Prox-PDA-IP and ATC terminated after 200 message exchanges; and (g): denoising achieved by centralized K-SVD (KSVD-Box v13).

Figure 7: Denoising outcome. (a): Linearized D4L; (b): Plain D4L; (c): Prox-PDA-IP; (d) ATC; all terminated after after 273 seconds run-time (corresponding to 200 message exchanges of the Linearized D4L).

each scenario, we generated 5 random instances (if a generated graph was not strongly connected we discarded it and generated a new one) and then ran Plain and Linearized D4L and ATC on the resulting 15 graphs. Recall that ATC was not designed to work on directed networks. We thus modiﬁed it by using our new consensus protocol (but not the gradient tracking mechanism); we term it Modiﬁed ATC. In Fig. 8 we plot the average value of the objective function [subplot on the left], the consensus disagreement eν [subplot in the center], and the distance from stationarity ν

[subplot on the right], achieved by Plain D4L, Linearized D4L, and Modiﬁed ATC, versus the number of message exchanges, for the three scenarios N1 [subplot (a)], N2 [subplot (b)]

Decentralized Dictionary Learning Over Time-Varying Digraphs

0 500 1000 1

0 500 1000 10-8

0 500 1000 10-4

0 500 1000 1

0 500 1000 10-5

0 500 1000 10-4

0 500 1000 1

0 500 1000 10-5

0 500 1000 10-4

Figure 8: Denoising problem D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over three network settings [cf. Table 3]: N1 [subplots (a)], N2 [subplots (b)], and N3 [subplots (c)].

Daneshmand, Sun, Scutari, Facchinei, Sadler

Network # I nc p1 p2 N1 500 50 0.9 0.9 N2 500 50 0.1 0.01 N3 500 50 0.05 0.01

Table 3: Network setting

and N3 [subplot (c)]. The average is taken over the aforementioned 5 digraph realizations. While also this batch of tests conﬁrms the better behavior of D4L schemes over ATC, it is interesting to observe that there seems to be little inﬂuence of the degree of connectivity on the behavior of Linearized and Plain D4L. The only aspect for which a reasonable inﬂuence can be seen is on consensus. In fact, with respect to consensus, Linearized D4L seems to improve over Plain D4L, when connectivity decreases. This has a natural interpretation. Plain D4L solves much more accurate subproblems at each iteration and this is, in some sense useless, especially in early iterations, when information has not spread across the network. It seems clear that the less connected the graph, the more time information needs to spread. Therefore, in scenario N1, the two methods are almost equivalent and, looking at consensus error, we see that initially Linearized D4L is better than Plain D4L, but soon, as information spreads, Plain D4L becomes, even if slightly, better than Linearized D4L. The same behavior can be observed for scenario N2, but this time the initial advantage of Linearized D4L is larger and the switching point is reached much later. This is consistent with the fact that information needs more time to spread and therefore solving the accurate subproblem is not advantageous. If one passes to N3, where connectivity is very loose, there is no switching point within the ﬁrst 1000 message exchanges.

5.2. Biclustering

Biclustering has been shown to be useful in several applications, including biology, information retrieval, and data mining; see, e.g., (Madeira and Oliveira, 2004).

Problem Formulation: We consider a Biclustering problem in the form (6), applied to genetic information. We solved the problem simulating a networked computer cluster composed of 500 nodes (see Table 3). The genetic data is borrowed from (Lee et al., 2010) (centered and normalized): the data matrix S of size 56 12, 625 (M = 56 and N = 12, 625) contains microarray gene expressions of 56 patients (rows); each patient is either identiﬁed to be normal (Normal) or belonging to one of the following three types of lung cancer: pulmonary carcinoid tumors (Carcinoid), colon metastases (Colon), and small cell carcinoma (Small Cell). We considered the unsupervised instance of the problem, meaning that none of the a-priori information about the type of patients cancer is used to perform biclustering. Following the numerical experiments of (Lee et al., 2010), we seek rank-3 sparse matrices Xi, and the data matrix S is equally distributed across the 500 nodes, resulting thus in K = 3 and ni = 26. The total number of variables is then 39, 168. The other parameters are set as follows: α = 1, λX = µX = 0.1, and λD = µD = 0.1.

Algorithms and tuning: We tested the instance of D4L where fi and hi are chosen according to (14) and (16), respectively. The rationale behind this choice is to exploit the

Decentralized Dictionary Learning Over Time-Varying Digraphs

extra structure of the original function fi, plus, in case of fi, there is no certain beneﬁt in using the linear approximation (15) as it does not lead to any closed form solution of the subproblem (8). Note that the Prox-PDA-IP scheme is not applicable here since the network is directed. The other parameters of the algorithm are set to: γν = γν 1(1 ϵγν 1), with γ0 = 0.2 and ϵ = 10 2; and τ ν D,i = 100 and τ ν X,i = max(L Xi(Uν (i)), 100). We term such an instance of D4L Plain D4L. We compared Plain D4L with the following algorithms: i) (a modiﬁed version of) the distributed ATC algorithm (Chainais and Richard, 2013), where the optimization of D is adjusted to solve (6) (the elastic-net penalty is added), and the consensus mechanism is modiﬁed with our new consensus protocol to handle directed network topologies; we termed this instance Modiﬁed ATC; and ii) the centralized SSVD algorithm proposed in Lee et al. (2010) (implemented using the MATLAB code provided by the authors), to benchmark the results obtained by the distributed algorithms. All the distributed algorithm are initialized setting each X0 i = 0, and each D0 (i) equal to some randomly chosen columns of Si. In D4L, the subproblems (8) and (11) at iteration ν do not have a closed form solution; they are solved using the projected (sub)gradient algorithm, with diminishing step-size γr = γr 1(1 ϵγr 1), where γ0 = 0.9, ϵ = 10 3, and r denotes the inner iteration index. A warm start is used for the projected subgradient algorithm; the initial points are set to Dν (i) and Xν i in problems (8) and (11), respectively, where ν is the iteration index of the outer loop. We terminate the projected subgradient algorithm solving (8) and (11) when Jr D,i b Dν,r (i) Dν,r (i) , 10 6 and Jr X,i b Xν,r i Xν,r i , , 10 6, respectively, where

b Dν,r (i) argmin D(i) D

D Dfi(Dν,r (i) , Xν i ) + IΘν (i) Dfi(Dν (i), Xν i ) + τ ν D,i Dν,r (i) Dν (i) , D(i) Dν,r (i) E

D(i) Dν,r (i)

2 + G D(i) ,

b Xν,r i argmin Xi RK ni

D Xifi(Uν (i), Xν,r i ) + τ ν X,i (Xν,r i Xν i ) , Xi Xν,r i E

2 Xi Xν,r i 2 + gi (Xi) ,

with Dν,r (i) and Xν,r i denoting the value of D(i) and Xi at the ν-th outer and r-th inner iteration, respectively. In all our simulations, the above accuracy was reached within 50 (inner) iterations of the projected subgradient algorithm.

Convergence speed and quality of the reconstruction: We simulated 3 directed static network topologies, namely: N1-N3, as given in Table 3. In Fig. 9 we plot he average value of the objective function [subplot on the left], the consensus disagreement eν [subplot in the center], and the distance from stationarity ν [subplot on the right], achieved by Plain D4L and Modiﬁed ATC, versus the number of message exchanges, for the three scenarios N1 [subplot (a)], N2 [subplot (b)] and N3 [subplot (c)]. The average is taken over 5 digraph realizations. Fig. 9 shows that Plain D4L algorithm attains satisfactory merit values in all network scenarios, while Modiﬁed ATC fails to reach consensus/convergence, even in highly connected networks. The poor performance of Modiﬁed ATC seem mainly due to the incapability of locking the consensus.

Daneshmand, Sun, Scutari, Facchinei, Sadler

0 500 1000 4500

0 500 1000 10-8

0 500 1000 10-4

0 500 1000 4500

0 500 1000 10-3

0 500 1000 10-4

0 500 1000 4500

0 500 1000 10-3

0 500 1000 10-4

Figure 9: Biclustering problem Plain D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over three network settings [cf. Table 3]: N1 [subplots (a)], N2 [subplots (b)], and N3 [subplots (c)].

Decentralized Dictionary Learning Over Time-Varying Digraphs

In order to assess the quality of the solutions achieved by the three algorithms, we employ the following procedure. Given the limit point (up to the ﬁxed accuracy) D of the algorithm under consideration, patients information is in form of (unlabeled clusters of) data points {D m,:}56 m=1, where D m,: denotes the m-th row of D and represents an individual patient. In order to compare D with the labeled ground truth, we need to tag labels to the clustered points of D . To do so, we run the K-means clustering algorithm on {D m,:}56 m=1. Speciﬁcally, we ﬁrst run K-means 100 times and, in each run, we perform a preliminary clustering using 10% of the points (randomly chosen). Then, among the 100 obtained clustering conﬁgurations, we picked the one with the smallest within-cluster sum of point-to-centroid distances .1 Finally, we assign to each cluster the label associated with the most populated type of cancer in the cluster. Denoting the ground truth classes by {Ci}4 i=1 (recall that there are 4 classes/types of cancer), where each Ci consists of the group of patients with the same type of cancer, and by { e Ci}4 i=1 the clustering obtained by the procedure described above applied to the outcome D of the simulated algorithms, we measure the quality of the clustering by the Jaccard index, deﬁned as

S i Ci e Ci P i Ci e Ci .

Clearly 0 J 1, and the higher the index value, the better the quality of the clustering. In Table 4, we report the average and Maximum Absolute Deviation (MAD) of the Jaccard indices from their average, computed over the aforementioned 5 realizations of the three graph topologies, as in Table 3 (see also Fig. 9). The values in the table clearly show that Plain D4L achieves better results than those produced by Modiﬁed ATC or centralized methods. Moreover, the value of the Jaccard index from Plain D4L does not depend on the speciﬁc network topology. which is not the case for Modiﬁed ATC.

Network # Plain D4L Modiﬁed ATC Lee et al. (2010) N1 0.8983/0 0.7778/0 N2 0.8983/0 0.7045/0.3218 N3 0.8983/0 0.7892/0.0172 Centralized 0.7231/

Table 4: Biclustering problem Average/MAD of Jaccard indices over 5 realizations of digraphs.

5.3. Non-negative Sparse Coding (NNSC) and Sparse PCA (SPCA)

Problem Formulation: We consider the Non-negative Sparse Coding (NSC) formulation (7) (Hoyer, 2004) and the Sparse PCA problem (6) (Mairal et al., 2010). For both formulations, we run experiments using the following two datasets:

MIT-CBCL face database #1 (Sung, 1996): a pool of N = 2, 429 vectorized face images of size 19 19 pixels each (i.e. M = 361);

1. Given a clustering partition {Ci}4 i=1, the within-cluster sum of point-to-centroid distances measures the quality of the k-means clustering, and is deﬁned as P4 i=1 P

j Ci ||D j,: D Ci||2, where DCi 1 |Ci| P

j Ci D j,: and |Ci| denotes the cardinality of the set Ci.

Daneshmand, Sun, Scutari, Facchinei, Sadler

The VOC 2006 database (Everingham et al., 2010): a pool of N = 10, 000 vectorized natural image patches of size 16 16 pixels each (i.e. M = 256).

Consistently with (Mairal et al., 2010), the free parameters are set as:

NNSC (7): K = 49, λ = µ = 1/

M, and α = 1;

Sparse PCA (6): K = 49, λX = µX = 1/

M, λD = µD = 1/

M, and α = 1.

The total number of variables for the above optimization problems are 136,710 for the MIT-CBCL dataset, and 502,544 for the VOC 2006 dataset. We simulated the communication network as static directed graphs of size I, clustered in nc groups, where each node has an outgoing arc to another node in the same cluster with probability p1, while p2 is the probability of an outgoing arc to a node in a diﬀerent cluster. We run our tests over 6 diﬀerent network scenarios, with various size I and probability pair (p1, p2), as given in Table 5. Note that if N/I is not an integer, we pad zero columns to the data matrix S so that all the agents own equal-size partitions Si s, thus ni = N/I in both problems (6) and (7).

Network # I nc p1 p2 N4 10 2 0.9 0.3 N5 10 2 0.2 0.1 N6 50 5 0.9 0.3 N7 50 5 0.2 0.1 N8 250 10 0.9 0.3 N9 250 10 0.2 0.1

Table 5: Network setting for the NNSC and Sparse PCA problems.

5.3.1. Non-negative Sparse Coding

Algorithms and tuning: We test the Plain D4L, with fi and hi chosen according to (15) and (16), respectively. The other parameters of the algorithm are set to: γν = γν 1(1 ϵγν 1), with γ0 = 0.2 and ϵ = 10 2; and τ ν D,i = 10 and τ ν X,i = max(L Xi(Uν (i)), 10). We compare the proposed scheme with a modiﬁed version of ATC, equipped with our new consensus protocol, implementable on directed networks. All the distributed algorithm are initialized setting X0 i = 0 and D0 (i) equal to some randomly chosen columns of Si. Both Plain D4L and Modiﬁed ATC call for solving a LASSO problem in updating the private variables (cf. Sec. 3.1); the update of the dictionary has instead a closed form expression, see (22). For both Plain D4L and Modiﬁed ATC, the LASSO subproblems at iteration ν are solved using the projected (sub)gradient algorithm with diminishing step-size γr = γr 1(1 ϵγr 1), where γ0 = 0.9, ϵ = 10 3, and r denoting the inner iteration index. We terminate the projected subgradient algorithm in the inner loop when Jr X,i b Xν,r i Xν,r i , 10 4, where

b Xν,r i argmin Xi Xi

D Xifi(Uν (i), Xν,r i ) + τ ν X,i (Xν,r i Xν i ) , Xi Xν,r i E +1

2 Xi Xν,r i 2+gi (Xi) ,

Decentralized Dictionary Learning Over Time-Varying Digraphs

and Xν,r i denotes the value of Xi at the ν-th outer and r-th inner iteration. In all our simulations, the above accuracy was reached within 30 (inner) iterations of the projected subgradient algorithm.

Convergence speed and quality of the reconstruction: We run the Plain D4L and Modiﬁed ATC algorithms over diﬀerent network settings, as listed in Table 5, and we terminated them after 1500 message exchanges. We replicated the tests for 5 independent realizations and we reported the average of the ﬁnal values of the objective function, the consensus disagreement, and the distance from stationarity in Table 6 and Table 7, for the MIT-CBCL and VOC 2006 datasets, respectively. In Fig. 10 and Fig. 11 (for MIT-CBCL and VOC 2006 datasets, respectively), we plot the average value (over the aforementioned 5 graph realizations) of the objective function [subplot on the left], the consensus disagreement eν [subplot in the center], and the distance from stationarity ν [subplot on the right], versus number of message exchanges, for the two extreme network scenarios N4 [subplot (a)] and N9 [subplot (b)]. These results show that the proposed Plain D4L signiﬁcantly outperforms the Modiﬁed ATC algorithm. They also show a remarkable stability of Plain D4L with respect to the simulated network graphs, which is not observed for the Modiﬁed ATC, whose performance deteriorates signiﬁcantly going from N4 to N9.

Network # objective value consensus disagreement distance from stationarity N4 169.8/171.9 9.17e-7/2.9e-4 4.7e-4/8.8e-2 N5 169.9/172.2 4.5e-6/6.7e-4 5.3e-4/9.8e-2 N6 169.8/177.2 2.4e-7/1.9e-4 5.1e-4/6.3e-2 N7 169.9/177.3 1.1e-6/6.3e-4 5.5e-4/6.1e-2 N8 169.8/191.0 2.1e-7/1.2e-4 5.1e-4/2.2e-2 N9 169.8/190.9 5.8e-7/3.1e-4 6.6e-4/1.1e-2

Table 6: NNSC problem (MIT-CBCL dataset) Plain D4L/Modiﬁed ATC algorithms: objective value, consensus disagreement, and distance from stationarity obtained after 1500 message exchanges.

Network # objective value consensus disagreement distance from stationarity N4 845.2/848.8 2.9e-6/1.3e-4 1.1e-3/4.1e-3 N5 845.9/850.1 1.5e-5/5.3e-4 1.5e-3/6.1e-3 N6 844.8/879.5 5.7e-7/2.2e-4 1.3e-3/4.4e-2 N7 844.5/879.3 1.9e-6/1.2e-3 1.3e-3/3.9e-2 N8 844.8/941.2 5.6e-7/1.5e-4 1.6e-3/4.8e-2 N9 845.0/941.8 1.5e-6/3.6e-4 1.1e-3/5.1e-2

Table 7: NNSC problem (VOC 2006 dataset) Plain D4L/Modiﬁed ATC algorithms: objective value, consensus disagreement, and distance from stationarity obtained after 1500 message exchanges.

Daneshmand, Sun, Scutari, Facchinei, Sadler

0 500 1000 1500 160

0 500 1000 1500 10-8

0 500 1000 1500 10-4

0 500 1000 1500 160

0 500 1000 1500 10-8

0 500 1000 1500 10-4

Figure 10: NNSC problem (MIT-CBCL dataset) Plain D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over the network settings N4 [subplots (a)] and N9 [subplots (b)] (cf. Table 5).

5.3.2. Sparse Principal Component Analysis (SPCA)

Algorithms and tuning: We test the same D4L version as used in the Biclustering experiments (cf. Subsec. 5.2), i.e., fi and hi are chosen according to (14) and (16), respectively; we set γν = γν 1(1 ϵγν 1), with γ0 = 0.2 and ϵ = 10 2; and τ ν D,i = 10 and τ ν X,i = max(L Xi(Uν (i)), 10). We term it Plain D4L. We compare Plain D4L with a modiﬁed version of the ATC algorithm, which has been adapted to solve (6) and equipped with our new consensus protocol to handle directed network topologies. All the distributed algorithm are initialized, setting X0 i = 0 and D0 (i) equal to some randomly chosen columns of Si. The subproblems (8) and (11) at iteration ν are solved using the projected (sub)gradient algorithm; the setting is the same as that used in the Biclustering problem (cf. Subsec. 5.2). We terminate the projected subgradient algorithm in the inner loop when Jr D,i 10 4 (in

Decentralized Dictionary Learning Over Time-Varying Digraphs

0 500 1000 1500 800

0 500 1000 1500 10-6

0 500 1000 1500 10-3

0 500 1000 1500 800

0 500 1000 1500 10-6

0 500 1000 1500 10-3

Figure 11: NNSC problem (VOC 2006 dataset) Plain D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over the network settings N4 [subplots (a)] and N9 [subplots (b)] (cf. Table 5).

solving subproblem (8)) and Jr X,i 10 4 (in solving subproblem (11)), where Jr D,i and Jr X,i are deﬁned as those in Subsec. 5.2. In all our simulations, the above accuracy was reached within 30 (inner) iterations of the projected subgradient algorithm.

Convergence speed and quality of the reconstruction: We test the Plain D4L and the Modiﬁed ATC in diﬀerent network settings, as listed in Table 5. The setting of the experiments and the averaging procedure of the reported values is the same of those used for the NNSC problem. The results of our experiments are reported in Table 8 and Figure 12 for the MIT-CBCL dataset; and in Table 9 and Figure 13 for the VOC 2006 dataset. The behaviors or the algorithms are very similar to those described in the NNSC case and conﬁrm all previous observations.

Daneshmand, Sun, Scutari, Facchinei, Sadler

0 500 1000 1500 0

0 500 1000 1500 10-5

0 500 1000 1500 10-4

0 500 1000 1500 0

0 500 1000 1500 10-5

0 500 1000 1500 10-4

Figure 12: SPCA problem (MIT-CBCL dataset) Plain D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over the network settings N4 [subplots (a)] and N9 [subplots (b)] (cf. Table 5).

Network # objective value consensus disagreement distance from stationarity N4 181.9/453.1 3.5e-5/6.1e-4 2.2e-3/21.39 N5 185.1/446.4 2.6e-5/1.5e-3 1.3e-3/136.2 N6 182.7/517.3 5.2e-6/4.3e-4 1.3e-3/1.2e-1 N7 186.3/512.0 2.3e-5/9.2e-4 1.4e-3/1.18 N8 181.9/566.0 4.0e-5/1.7e-4 2.6e-3/1.5e-1 N9 182.6/566.9 1.7e-4/2.9e-4 4.2e-3/1.0e-1

Table 8: SPCA problem (MIT-CBCL dataset) Plain D4L/Modiﬁed ATC algorithms: objective value, consensus disagreement, and distance from stationarity obtained after 1500 message exchanges.

Decentralized Dictionary Learning Over Time-Varying Digraphs

0 500 1000 1500 500

0 500 1000 1500 10-6

0 500 1000 1500 10-4

0 500 1000 1500 500

0 500 1000 1500 10-6

0 500 1000 1500 10-4

Figure 13: SPCA problem (VOC 2006 dataset) Plain D4L and Modiﬁed ATC algorithms: objective value [subplots on the left], consensus disagreement [subplots in the center], and distance from stationarity ν [cf. (26)] [subplots on the right] vs. number of message exchanges. Comparison over the network settings N4 [subplots (a)] and N9 [subplots (b)] (cf. Table 5).

Network # objective value consensus disagreement distance from stationarity N4 849.3/2516.0 5.7e-6/1.3e-3 2.3e-3/6532 N5 866.0/2533.6 1.8e-5/3.5e-3 2.0e-3/1.3e+4 N6 864.8/2621.0 7.2e-6/4.3e-4 3.4e-3/1.7e+4 N7 859.1/2624.1 5.3e-5/1.1e-3 2.4e-3/1.6e+4 N8 872.7/2643.2 6.6e-5/2.3e-4 1.1e-2/2.9e+4 N9 869.1/2642.0 2.3e-4/4.6e-4 5.6e-3/3.4e+4

Table 9: SPCA problem (VOC 2006 dataset) Plain D4L/Modiﬁed ATC algorithms: objective value, consensus disagreement, and distance from stationarity obtained after 1500 message exchanges.

Daneshmand, Sun, Scutari, Facchinei, Sadler

6. Conclusions

This paper studied a fairly general class of distributed dictionary learning problems over time-varying multi-agent networks, with arbitrary topologies. We proposed the ﬁrst decentralized algorithmic framework the D4L Algorithm with provable convergence for this class of problems. Numerical experiments showed promising performance of our scheme with respect to state-of-the-art distributed methods.

Acknowledgments

The work of Daneshmand, Sun, and Scutari has been supported by the the National Science Foundation (NSF grants CIF 1564044 and 1632599) and the Army Research Oﬃce (Grant W911NF1810238). The work of Facchinei has been partially supported by the Italian Ministry of Education, Research and University, under the PLATINO (PLATform for INn Ovative services in future internet) PON project (Grant Agreement no. PON01 01007).

Appendix A. Appendix

In this section we prove the major results of the paper, Theorems 2 and 3. We begin rewriting the D4L Algorithm in an equivalent vector-matrix form (cf. Sec. A.1), which is more suitable for the analysis. Theorem 2 and Theorem 3 are then proved in Sec. A.2 and Sec. A.3, respectively. Some miscellanea results supporting the main proofs are collected in Appendix A.4. Table 10 below summarizes the symbols appearing in the proofs.

Symbol Deﬁnition Member of Reference

Dφν (1/I) PI i=1 φν i Dν (i) D RM K (41)

Uφν (1/I) PI i=1 φν i Uν (i) D RM K (41) e Dν [ e Dν (1), e Dν (2), . . . , e Dν (I)] RM KI (34)

Uν [Uν (1), Uν (2), . . . , Uν (I)] RM KI (34)

Θν [Θν (1), Θν (2), . . . , Θν (I)] RM KI (34)

Gν [ Df1(Dν (1), Xν 1) , . . . , Df I(Dν (I), Xν I ) ] RM KI (34)

Wν (wν ij)I i,j=1 = (aν ij φν j /φν+1 i )I i,j=1 = (Φν+1) 1AνΦν RI I (35)

Wν:l Wν Wν 1 Wl, ν > l RI I (44)

c Wν Wν IM RMI MI (36)

c Wν:l Wν:l IM, ν > l RMI MI (44)

φν [φν 1, φν 2, . . . , φν I ] RI (34)

Φν diag (φν) RI I (34)

bΦ ν Φν IM RMI MI (36) Jφν (1/I)1φν RI I (45) b Jφν Jφν IM RMI MI (45)

Table 10: Table of notation (appendix)

Decentralized Dictionary Learning Over Time-Varying Digraphs

A.1. D4L in vector-matrix form

We rewrite Algorithm 1 in a more convenient vector-matrix form. To do so, we introduce the following notation. Recalling the deﬁnitions of e Dν (i) [cf. (8)], Uν (i) [cf. (10)], φν i [cf. (12)], and Θν (i) [cf. (13)], deﬁne the corresponding aggregate quantities

e Dν [ e Dν (1), e Dν (2), . . . , e Dν (I)] ,

Uν [Uν (1), Uν (2), . . . , Uν (I)] ,

φν [φν 1, φν 2, . . . , φν I] , Φν diag (φν) ,

Θν [Θν (1), Θν (2), . . . , Θν (I)] ,

Gν Df1(Dν (1), Xν 1) , Df2(Dν (2), Xν 2) , . . . , Df I(Dν (I), Xν I) ,

where diag(x) is a diagonal matrix whose diagonal entries are the components of the vector x. Combining the weights aν ij and the variables φν i in the update of Dν (i) [cf. (12)] in the single coeﬃcient

wν ij aν ijφν j P k aν ikφν k ,

we deﬁne the weight matrix Wν, whose entries are [Wν]i,j = wν ij. Note that Wν has the same zero-pattern of Aν, and the following properties hold (the latter under Assumption F) Wν = Φν+1 1AνΦν, and Wν1 = 1. (35)

Finally, we deﬁne the following augmented matrices

c Wν Wν IM and bΦ ν Φν IM, (36)

where IM is the M-by-M identity matrix. Using the above notation, the main iterates of the D4L Algorithm, i.e., (10), (12), and (13), can be rewritten in compact form as

Uν = Dν + γν( e Dν Dν), (37)

φν+1 = Aνφν, (38)

Dν+1 = c WνUν, (39)

Θν+1 = c WνΘν + bΦ ν+1 1 Gν+1 Gν . (40)

Instrumental to the analysis of the consensus disagreement are the following weighted average quantities:

i=1 φν i Dν (i), Uφν 1

i=1 φν i Uν (i). (41)

Using (12), (41) and the column stochasticity of Aν (cf. Assumption F3), it is not diﬃcult to check that

Dφν+1 = Uφν, (42)

Daneshmand, Sun, Scutari, Facchinei, Sadler

which, together with (10), leads to the following dynamics for Dφν:

Dφν+1 = Dφν + γν

i=1 φν i e Dν (i) Dν (i) . (43)

A.2. Proof of Theorem 2

We prove Theorem 2 in the following order (consistent with the statements in the theorem):

Step 1 Asymptotic consensus & related properties: We prove that consensus is asymptotically achieved, that is, limν eν = 0 [statement (a)], along with some properties on related quantities, which will be used in the other steps see Sec. A.2.1;

Step 2 Boundedness of the iterates and Lipschitz continuity of fi: We prove that the sequence Dν, Xν

ν generated by the algorithm is bounded [statement (b-i)], and, as a consequence, fi, D fi, and Xi hi are Lipschitz continuous on a (suitably deﬁned) compact set containing Dν, Xν

ν [Remarks 8 and 9] see Sec. A.2.2;

Step 3 Decrease of {U(D ν, Xν)}ν: By leveraging the Lipschitz continuity of the gradients as in Step 2, we study the decrease properties of {U(D ν, Xν)}ν [statement (b-ii)] see Sec. A.2.3;

Step 4 Vanishing X-stationarity: We prove that limν X(D ν, Xν) = 0 [statement (b-iii)] see Sec. A.2.4;

Step 5 Vanishing liminf D-stationarity: We prove lim infν D(D ν, Xν) = 0 [statement (b-iv)] see Sec. A.2.5;

Step 6 Vanishing D-stationarity: Finally, we prove limν D(D ν, Xν) = 0 [statement (b )] see Sec. A.2.6.

A.2.1. Step 1 Asymptotic consensus and related properties

1) Preliminaries: To analyze the dynamics of the consensus disagreement, we ﬁrst introduce the following product matrices and their augmented counterparts: Given Wν [cf. (35)], and ν, l N+, let

Wν Wν 1 Wl, ν > l, Wν, ν = l, 0I, ν < l,

and c Wν:l Wν:l IM. (44)

Deﬁne also the weight-average matrices

I 1φν , b Jφν Jφν IM. (45)

For notational simplicity, when φν = 1, we use J instead of Jφν and b J J IM. Using (35) and (41), it is not diﬃcult to check that the following hold:

b JφνDν = 1 Dφν, (46)

b Jφν+1 c Wν:l = b J bΦ l = b Jφl. (47)

Decentralized Dictionary Learning Over Time-Varying Digraphs

The dynamics of the consensus disagreement eν boils down to studying the decay of Wν:l Jφl 2 (this will be clear in the proof of Proposition 5 below). The following lemma shows that Wν:l converges geometrically to Jφl, as ν .

Lemma 4 (Scutari and Sun (2018)-Lemma 4.13, Ch. 3.4.2.5) Let {Gν}ν be a sequence of digraphs satisfying Assumption B; let {Aν}ν be a sequence of matrices satisfying Assumption F; and let {Wν}ν be the sequence of matrices deﬁned in (35). Then, there holds Wν:l Jφl 2 c W (ρ)ν l+1, ν l, ν, l N+,

where c W > 0 is a (proper) constant, and ρ (0, 1) is deﬁned as

ρ = 1 κ (I 1)B 1 B(I 1) < 1, (48)

with κ = κIB+1/I; and κ is deﬁned in Assumption F. Furthermore, the sequence {φν i }ν satisﬁes

ϵφ inf ν N+

min 1 i I φν i

κIB, and ϵφ sup ν N+

max 1 i I φν i

I (I 1)κIB. (49)

If all the matrices Aν are doubly-stochastic, then ϵφ = ϵφ = 1.

2) Proof of limν eν = 0. Using (30), we can write

eν (a) K Dν 1 D ν F

(b) K Dν 1 Dφν F + K

I D ν Dφν F (c) 2K Dν 1 Dφν F ,

where (a) follows from the equivalence of norms; (b) is due to the triangle inequality; and in (c) we used PI i=1 ai

I||a||, with a = (ai)I i=1 RI. The following proposition concludes the proof of statement (a), proving that Dν 1 Dφν F is square summable, along with some additional properties on related quantities.

Proposition 5 In the above setting, there hold:

lim ν ||Dν 1 Dφν||F = 0; (51)

t=1 ||Dt 1 Dφt||2 F < ; (52)

t=1 ||Ut 1 Uφt||2 F < . (53)

Proof To prove (51), let us ﬁrst expand Dν 1 Dφν as follows: for any ν 1,

Dν (39) = c Wν 1Uν 1 = c Wν 1Dν 1 + c Wν 1 Uν 1 Dν 1

= c Wν 1:0D0 +

t=0 c Wν 1:t Ut Dt , (54)

Daneshmand, Sun, Scutari, Facchinei, Sadler

where the last equality follows from induction and the deﬁnition of c Wν:l [cf. (44)]. Similarly, we expand the subtrahend as

1 Dφν (46) = b JφνDν (54) = b Jφν

c Wν 1:0D0 +

t=0 c Wν 1:t Ut Dt !

t=0 bΦ t Ut Dt !

Subtracting (55) from (54) and using (37), yields Dν 1 Dφν F c Wν 1:0 b J 2

t=0 γt c Wν 1:t b Jφt 2

(a) c1(ρ)ν + c2

t=0 γt (ρ)ν t (b) ν 0,

for some ﬁnite constants c1, c2 > 0, where (a) is due to Lemma 4 and the boundedness of {|| e Dν Dν||}ν; and (b) follows from Lemma 13(a) in Appendix A.4. Let us now proceed to prove (52). Using (56), we have

c1(ρ)t + c2

l=0 γl (ρ)t l !2

(a) 2c2 1 1 (ρ)2 + 2 c2 2 lim ν

k=0 γlγk(ρ)t k(ρ)t l

(b) 2c2 1 1 (ρ)2 + c2 2 lim ν

l=0 (γl)2(ρ)t l t 1 X

k=0 (ρ)t k + c2 2 lim ν

k=0 (γk)2(ρ)t k t 1 X

2c2 1 1 (ρ)2 + 2c2 2 1 ρ lim ν

l=0 (γl)2(ρ)t l (c) < ,

where in (a) and (b) we used (a + b)2 2(a2 + b2) and ab (a2 + b2)/2, respectively, and (c) is due to Lemma 13(b) (cf. Appendix A.4). We prove now (53). Using (41) and (37), we get Ut 1 Uφt 2

γt e Dt 1 1

i=1 φt i e Dt (i) + 1 γt Dt 1 Dφt

2 γt 2 e Dt 1 1

i=1 φt i e Dt (i) 2

F + 2 Dt 1 Dφt 2

where in the last inequality we used Jensen s inequality and (1 γt) 1. Therefore,

F lim ν 2 c3

γt 2 + lim ν 2

where c3 is a positive ﬁnite constant, and (a) follows from Assumption E and (52).

Decentralized Dictionary Learning Over Time-Varying Digraphs

Remark 6 (On L Xi(Uφν) and L Xi(Uν (i))) Recall that Xifi(Uφν, ) is Lipschitz con-

tinuous on Xi, with constant L Xi(Uφν) (cf. Assumption A2). Since ||Uν (i) Uφν||F ν 0

[cf. (53), Proposition 5] and L Xi(D) is continuous, we have L Xi(Uφν) L Xi(Uν (i)) ν 0, i = 1, 2, . . . I. (58)

A.2.2. Step 2 Boundedness of the iterates

We show that the sequence Dν, Xν

ν generated by the D4L Algorithm is bounded [statement (b-i)]. We prove the result only for hi given by (17); the proof can be easily tailored to the other choice of hi. If the sets Xi are bounded [Assumption A5(i)], the result follows readily. Therefore, we consider next the setting under A5(ii). Throughout the proof, we will use the following properties of fi and hi.

Remark 7 The surrogate functions fi and hi as in Assumption C have the following properties: for all i = 1, 2, . . . , I,

(a) fi( ; D, Xi) is strongly convex on D, uniformly with respect to (D, Xi) D Xi, with constant τ ν D,i > 0; and D fi(D; D, Xi) = Dfi(D, Xi), for all (D, Xi) D Xi.

(b) hi( ; D, Xi) is strongly convex on Xi, uniformly with respect to (D, Xi) D Xi, with constant τ ν X,i > 0; and Xi hi(Xi; D, Xi) = Xifi(D, Xi), for all (D, Xi) D Xi.

By the optimality of Xν+1 i in (11), there exist Ξ0 i Xigi(X0 i ) and Ξν+1 i Xigi(Xν+1 i ) such that 0 D Xi hi(Xν+1 i ; Uν (i), Xν i ) + Ξν+1 i , X0 i Xν+1 i E

= Ξν+1 i Ξ0 i + τ ν X,i(Xν+1 i X0 i ), X0 i Xν+1 i

+ D Xifi(Uν (i), Xν i ) Xifi(Uν (i), X0 i ), X0 i Xν+1 i E

τ ν X,i(Xν i X0 i ), X0 i Xν+1 i

+ D Xifi(Uν (i), X0 i ) + Ξ0 i , X0 i Xν+1 i E .

Using Remark 7(b) and the µi-strongly convexity of gi s, we obtain

(τ ν X,i + µi) Xν+1 i X0 i 2 F D τ ν X,i Xν i Xifi(Uν (i), Xν i ), Xν+1 i X0 i E

D τ ν X,i X0 i Xifi(Uν (i), X0 i ), Xν+1 i X0 i E

Xifi(Uν (i), X0 i ) + Ξ0 i | {z } Zν i Xi U(Uν (i),X0 i )

, Xν+1 i X0 i

Deﬁne Υν i (Xi) τ ν X,i Xi Xifi(Uν (i), Xi) and rewrite (59) as

(τ ν X,i + µi) ||Xν+1 i X0 i ||F Υν i (Xν i ) Υν i (X0 i ) F + Zν i F . (60)

Daneshmand, Sun, Scutari, Facchinei, Sadler

Since D is compact and X0 i is given, we have ||Zν i ||F BZ, for all i, ν 1, and some ﬁnite BZ > 0. Let us bound next ||Υν i (Xν i ) Υν i (X0 i )||F . We write

Υν i (Xν i ) Υν i (X0 i ) 2 F = (τ ν X,i)2 Xν i X0 i 2 F + Xifi(Uν (i), Xν i ) Xifi(Uν (i), X0 i ) 2

2τ ν X,i D Xifi(Uν (i), Xν i ) Xifi(Uν (i), X0 i ), Xν i X0 i E

(a) (τ ν X,i)2 Xν i X0 i 2 F

1 2τ ν X,i L Xi(Uν (i))

! Xifi(Uν (i), Xν i ) Xifi(Uν (i), X0 i ) 2

(61) where in (a) we used the 1/L Xi(Uν (i))-co-coercivity of Xifi(Uν (i), ) [due to the convexity of fi(Uν (i), ) and the L Xi(Uν (i))-Lipschitianity of Xifi(Uν (i), ) (Rockafellar and Wets, 1998, Prop.12.60)], i.e., D Xifi(Uν (i), Xi) Xifi(Uν (i), Yi), Xi Yi E 1 L Xi(Uν (i)) Xifi(Uν (i), Xi) Xifi(Uν (i), Yi) 2

F , Xi, Yi Xi.

Note that ||Υν i (Xν i ) Υν i (X0 i )||F τ ν X,i||Xν i X0 i ||F as long as τ ν X,i 1

2L Xi(Uν (i)), which is satisﬁed under D1. Therefore, we can bound (60) as

(τ ν X,i + µi)||Xν+1 i X0 i ||F τ ν X,i Xν i X0 i F + BZ. (62)

We can now prove that, starting from X0 i , the iterates Xν i stays in the ball Bi(Ri, X0 i ) {Xi RK ni : Xi X0 i F Ri}, for all ν 1, where Ri BZ/µi. Let us prove it by induction. Evidently X0 i Bi(Ri, X0 i ). Let Xν i Bi(Ri, X0 i ); by (62), we get

||Xν+1 i X0 i ||F τ ν X,i τ ν X,i + µi ||Xν i X0 i ||F + BZ τ ν X,i + µi Ri,

where the second inequality is due to Xν i Bi(R, X0 i ) and R BZ/µ. Hence Xν+1 i Bi(Ri, X0 i ). Therefore, Xν i Bi(Ri, X0 i ), for all ν 0. Since D is bounded (cf. Assumption A3), it follows that (Dν (i), Xν i ) D Bi(Ri, X0 i ), for all ν 0.

Remark 8 (On the Lipschitz continuity of fi s) Since fi is C2, a direct consequence of the boundedness of Dν, Xν

ν is that fi [the gradient of fi with respect to (D, Xi)] is Lipschitz continuous on D Bi(Ri, X0 i ), that is, there exists some positive ﬁnite constant L ,i such that

|| fi(D, Xi) fi(D , X i)||F L ,i ||(D, Xi) (D , X i)||F , (63)

for all (D, Xi), (D , X i) D Bi(Ri, X0 i ), and i = 1, 2, . . . , I. We deﬁne L maxi L ,i. The above result also implies that DF : D (X1 XI) D [cf. (P)] is Lipschitz continuous on D (B1(R1, X0 1) BI(RI, X0 I)), with constant L D I L .

Decentralized Dictionary Learning Over Time-Varying Digraphs

Remark 9 (On the Lipschitz continuity of D fi and Xi hi) D fi : D RK ni RK ni RM K and Xi hi : RK ni D RK ni RK ni are Lipschitz continuous on D D Bi(Ri, X0 i ) and Bi(Ri, X0 i ) D Bi(Ri, X0 i ), respectively, with constants LD ,i and LX ,i. Let us denote LD maxi LD ,i and LX maxi LX ,i.

A.2.3. Step 3 Decrease of {U(D ν, Xν)}ν

We study here the properties of {U(D ν, Xν)}ν showing, in particular, that it is convergent [statement (b-ii)]. We begin with the following intermediate result.

Proposition 10 Consider the setting of Theorem 2(b); there exist positive constants s X, τD, c7, c8, and a suﬃciently large ν N+ such that the following holds: for all ν ν,

U(Dφν+1, Xν+1) U(Dφ ν, X ν)

l= ν W l + Eν, ν, (64)

||Xl+1 Xl||F Zl

I γl || e Dl Dl||F I ϵφ

2 τD T l 2 , (65)

t=l γt(ρ)t l + LX Ul 1 Uφl F , (66)

W l I ϵ2 φ 4 τD γl T l 2 + 1 4 s X

Zl 2 + LGγl Dl 1 Dφl F , (67)

1 ρ((ρ) ν (ρ)ν+1) + c7BX (1 ρ)2

max t ν γt , (68)

{T ν}ν is such that

ν=1 (T ν)2 < , (69)

with ρ (0, 1) and ϵφ deﬁned in (48) and (49), respectively.

Proof See Appendix A.4.3

Note that the sequences {Zl}l, {W l}l, and {Eν, ν}ν, ν above satisfy

l= ν (Zl)2 < ; (70)

l= ν W l < ; (71)

lim ν Eν, ν

| {z } E , ν<

Daneshmand, Sun, Scutari, Facchinei, Sadler

where (70) follows from (53) [cf. Prop. 5], Assumption E, and Lemma 13(b) [cf. (107)] in Appendix A.4.1; (71) is a consequence of limν Pν l=ν1 γl(T l)2 < [due to (69)], (52) [cf. Prop. 5], and (70); and eq. (72) is proved by inspection. It follows from (64), (70)-(72) and the lower-boundedness of U (due to Assumption A1) that {U(D ν, Xν)}ν is convergent. Indeed, taking the limsup of the LHS of (64) and using (71) and (72), we get

< lim sup ν U(Dφν+1, Xν+1) U(Dφ ν, X ν) +

l= ν W l + E , ν < .

Taking now the liminf of the RHS of the above inequality with respect to ν while using (72) and lim ν P l= ν W l = 0, yields

< lim sup ν U(Dφν+1, Xν+1) lim inf ν U(Dφ ν, X ν) < ,

which implies the convergence of {U(Dφν, Xν)}ν to a ﬁnite value, and

||Xl+1 Xl||F Zl

l= ν γl || e Dl Dl||F I ϵφ

2 τD T l 2 < . (74)

Finally, we deduce that {U(D ν, Xν)}ν converges to the same limit point of {U(Dφν, Xν)}ν,

I D ν Dφν F Dν 1 Dφν F (51) ν 0; ii) the continuity of U; and iii) the

boundedness of Dν, Xν

ν [cf. Sec. A.2.2]. This concludes the proof.

A.2.4. Step 4 Vanishing X-stationarity

Building on the results in the previous step, we prove here limν X(D ν, Xν) = 0 [statement (b-iii)]. For notational simplicity, we will use the shorthand b Xν i b Xi(D ν, Xν) and b Xν b X(D ν, Xν), with b Xi(D ν, Xν) and b X(D ν, Xν) deﬁned in (29). Using the equivalence of norms and the triangle inequality, we have

X(D ν, Xν) KX b Xν Xν F KX Xν+1 Xν F | {z } term I + KX Xν+1 b Xν F | {z } term II ,

for some KX > 0. The rest of the proof consists in showing that term I and term II above are asymptotically vanishing. On term I: Term I can be bounded as

1 2||Xν+1 Xν||2 F ||Xν+1 Xν||F Zν

for all ν N+, where the inequality follows from 1

2a2 (a b)2 + b2, with a, b R. This, together with (70) and (73), yields

ν=0 ||Xν+1 Xν||2 F < = lim ν ||Xν+1 Xν||F = 0. (76)

Decentralized Dictionary Learning Over Time-Varying Digraphs

On term II: Invoking the optimality of b Xν i [cf. (29)] and Xν+1 i [cf. (11)], yields

D Xifi(D ν, Xν i ) + ˆτX( b Xν i Xν i ), Xν+1 i b Xν i E + gi(Xν+1 i ) gi( b Xν i ) 0, D Xi hi(Xν+1 i ; Uν (i), Xν i ), b Xν i Xν+1 i E + gi( b Xν i ) gi(Xν+1 i ) 0.

Summing the two inequalities above and using Remark 7(b), lead to

ˆτX||Xν+1 i b Xν i ||2 F D ˆτX Xν+1 i Xν i + Xifi(D ν, Xν i ) Xifi(Uν (i), Xν i ), Xν+1 i b Xν i E

+ D Xi hi(Xν i ; Uν (i), Xν i ) Xi hi(Xν+1 i ; Uν (i), Xν i ), Xν+1 i b Xν i E . (77) Using the LX ,i-Lipschitz continuity of Xi hi [cf. Remark 9] and the L ,i-Lipschitz continuity of fi, and (10), it is not diﬃcult to show that (77) implies

||Xν+1 i b Xν i ||F

LX ,i + ˆτX

Xν+1 i Xν i F + L ,i

D ν Dν (i) F + L ,i

ˆτX γν e Dν (i) Dν (i) F ,

LX ,i + ˆτX

Xν+1 i Xν i F + L ,i

D ν Dν (i) F + L ,i

for some BD > 0, where in the last inequality we used the boundedness of {|| e Dν Dν||F }ν. Eq. (78) together with (76), Theorem 2(a), and γν ν 0 (cf. Assumption E), yield

lim ν Xν+1 i b Xν i F = 0. (79)

This concludes the proof of statement (b-iiii).

A.2.5. Step 5 Vanishing liminf D-stationarity

We prove lim infν D(D ν, Xν) = 0 [statement (b-iv)] and {(D ν, Xν)}ν has at least one limit point which is a stationary solution of P. For notational simplicity, we will use the shorthand b Dν b D(D ν, Xν), with b D(D ν, Xν) deﬁned in (28). We begin bounding D(D ν, Xν) as

D(D ν, Xν) KD b Dν D ν F KD e Dν (i) Dν (i) F | {z } term I

+KD e Dν (i) b Dν F | {z } term II

+KD Dν (i) D ν F ,

for some KD >0. Note that Dν (i) D ν F 0 [Theorem 2(a)]. We show next that lim infν e Dν (i) Dν (i) F = 0 and e Dν (i) b Dν F 0, which proves lim infν D(D ν, Xν) = 0. On term I: Similarly to the derivations of (75), there holds

2 || e Dν Dν||2 F γν || e Dν Dν||F I ϵφ

2 τD T ν 2 + I ϵφ

2 γν(T ν)2, (80)

Daneshmand, Sun, Scutari, Facchinei, Sadler

for all ν N+. By eq. (74) and P ν=0 (T ν)2 < [cf. (69)], we have

ν=0 γν|| e Dν Dν||2 F < Assumption E = lim inf ν || e Dν Dν||F = 0. (81)

On term II: Using the optimality of b Dν [cf. (28)] and e Dν (i) [cf. (8)], yields D DF(D ν, Xν) + ˆτD( b Dν D ν), e Dν (i) b DνE + G( e Dν (i)) G( b Dν) 0, D D fi( e Dν (i); Dν (i), Xν i ) + I Θν i Dfi(Dν (i), Xν i ), b Dν e Dν (i) E + G( b Dν) G( e Dν (i)) 0.

Summing the two inequalities above and using Remark 7(a), yields

ˆτD|| e Dν (i) b Dν||2 F

D ˆτD( e Dν (i) Dν (i)) + ˆτD(Dν (i) D ν) + DF(D ν, Xν) DF(Dφν, Xν), e Dν (i) b DνE

+ D DF(Dφν, Xν) I Θν i , e Dν (i) b DνE

D D fi( e Dν (i); Dν (i), Xν i ) D fi(Dν (i); Dν (i), Xν i ), e Dν (i) b DνE . (82) Using the LD ,i-Lipschitz continuity of D fi [cf. Remark 9] and the L D-Lipschitz continuity of DF [cf. Remark 8], it is not diﬃcult to check that (82) implies

e Dν (i) b Dν F LD ,i + ˆτD

e Dν (i) Dν (i) F + D ν Dν (i) F + L D

I DF(Dφν, Xν) F ,

(83) for all i = 1, . . . , I. Since lim infν e Dν (i) Dν (i) F = 0 [cf. (81)], D ν Dν (i) F 0

[Theorem 2(a)], and Dν 1 Dφν 0 [cf. (51)], to prove lim infν e Dν (i) b Dν F = 0, it is suﬃcient to show that the last term on the RHS of the above inequality is asymptotically vanishing, which is done in the lemma below.

Lemma 11 (Vanishing gradient-tracking error) In the setting above, there holds:

I DF(Dφν, Xν) 2

Proof See Sec. A.4.4.

By lim infν D(D ν, Xν) = 0, it follows that there exists an inﬁnte subset N N+ such that lim N ν D(D ν, Xν) = 0. Since {(Dν, Xν)}ν is bounded [cf. Sec. A.2.2], it has a convergent subsequence {(Dν, Xν)}ν N , with N N; let (D , X ) denote its limit point. Then, it must be lim N ν D(D ν, Xν) = 0. Combining this result with limν X(D ν, Xν) = 0 (cf. Sec. A.2.4), one can conclude lim N ν ν = 0; hence, (D , X ) is a stationary solution of Problem P.

Decentralized Dictionary Learning Over Time-Varying Digraphs

A.2.6. Step 6 Vanishing D-stationarity

Finally, we prove limν D(D ν, Xν) = 0 [statement (b )]. In view of the results already proved in Step 5, it is suﬃcient to show that lim supν || e Dν Dν||F = 0. 1) Preliminaries: We begin introducing the following preliminary results.

Proposition 12 In the setting of Theorem 2(a), the following hold for e Dν [cf. (8)] and Xν [cf. (11)]:

(a) There exists some constant LD > 0 and sequence { T ν}ν, with limν T ν = 0, such that, for any ν1, ν2 N+,

|| e Dν2 e Dν1||F LD ||Dφν2 Dφν1||F + ||Xν2 Xν1||F + T ν1 + T ν2; (85)

(b) There exist some constants 0 < p X < 1 and q X > 0, and a suﬃciently large νX N+ such that, for all ν νX,

||Xν+1 Xν||F p X||Xν Xν 1||F + q X||Uν Uν 1||F . (86)

Proof See Appendix A.4.5

2) Proof of lim supν || e Dν Dν||F = 0. For notational simplicity, let us deﬁne e Dν e Dν Dν. Suppose by contradiction that lim supν || e Dν||F > 0; since lim infν || e Dν||F = 0 [cf. (81)], there exists δ > 0 such that || e Dν||F > 2δ and || e Dν ||F < δ for inﬁnitely many ν, ν N+. Therefore, one can ﬁnd an inﬁnite subset of indices, denoted by K, having the following properties: for any ν K, there exists an index iν > ν such that

|| e Dν||F < δ, || e Diν||F > 2δ, (87)

δ || e Dj||F 2δ, ν < j < iν. (88)

Let ν2 be a suﬃciently large integer such that (64) holds and T ν < 2 τD δ

I ϵφ , for all ν ν2

[such ν2 exists, due to (69)]. Note that there exists a δ > 0 such that δ I ϵφ

2 τD T ν δ, for all ν ν2. Choose K ν ν2; using (64), with ν = iν and ν = ν + 1, yields

U(Dφiν +1, Xiν+1) U(Dφν+1, Xν+1) c8 δ 2 iν X

l=ν+1 W l + Eiν,ν+1, (89)

for some ﬁnite constant c8 > 0. Using the convergence of {U(Dφν, Xν)}ν, P l=1 W l < [cf. (71)], and lim K ν Eiν,ν+1 = 0 [cf. (72)], inequality (89) implies

l=ν+1 γl = 0. (90)

We show next that (90) leads to a contradiction.

Daneshmand, Sun, Scutari, Facchinei, Sadler

It follows from (87) and (88) that, for all K ν ν2,

δ <|| e Diν||F || e Dν||F

(a) || e Diν e Dν||F = || e Diν Diν e Dν + Dν||F

(b) e Diν e Dν + Dν 1 Dφν 1 Dφiν Diν F

I ||Dφiν Dφν||F + e Eiν,ν,

e Eiν,ν LD||Xiν Xν||F + ||Dν 1 Dφν||F + ||Diν 1 Dφiν ||F + T ν + T iν,

where in (a) we used the reverse triangle inequality (i.e. ||A||F ||B||F ||A B||F , A, B RMI K); and in (b) we add/subtracted some dummy terms and used the triangle inequality. We prove next that limν e Eiν,ν = 0. Clearly, if limν ||Xiν Xν||F = 0, then (51) [cf. Proposition 5] and T ν ν 0 [cf. Proposition 12(a)] imply limν e Eiν,ν = 0. It is then suﬃcient to show limν ||Xiν Xν||F = 0. First, we bound ||Xiν Xν||F properly. Summing (86) from ν > ν2 to iν 1, yields

t=ν ||Xt+1 Xt||F p X

t=ν ||Xt Xt 1||F + q X

t=ν ||Ut Ut 1||F , (92)

||Xiν Xν||F

t=ν ||Xt+1 Xt||F p X 1 p X ||Xν Xν 1||F + q X 1 p X

t=ν ||Ut Ut 1||F .

Since limν ||Xν Xν 1||F = 0 [cf. (76)], it follows from the above inequality that lim K ν ||Xiν Xν||F = 0, if lim K ν Piν 1 t=ν ||Ut Ut 1||F = 0, which is proved next. Rewrite ﬁrst ||Ut Ut 1||F as

||Ut Ut 1||F ||Ut 1 Uφt||F + ||Ut 1 1 Uφt 1||F +

I||Uφt Uφt 1||F

(42) (43) = ||Ut 1 Uφt||F + ||Ut 1 1 Uφt 1||F + γt

i=1 φt i e Dt (i) Dt (i) F .

Since || PI i=1 φt i ( e Dt (i) Dt (i))||F is bounded (due to φt i ϵφ [cf. (49)] and compactness of D)

and lim K ν Piν t=ν γt = 0 [cf. (90)], there holds lim K ν Piν t=ν γt|| PI i=1 e Dt (i) Dt (i)||F =

0. Therefore, to prove lim K ν Piν 1 t=ν ||Ut Ut 1||F = 0, it is suﬃcient to show that lim K ν Piν t=ν ||Ut 1 Uφt||F = 0 [which implies also lim K ν Piν t=ν ||Ut 1 1 Uφt 1||F = 0, due to limν ||Uν 1 Uφν||F = 0, see (53)]. By (57), the boundedness

Decentralized Dictionary Learning Over Time-Varying Digraphs

of { e Dν}ν, and (90), it is suﬃcient to show that lim K ν Piν t=ν ||Dt 1 Dφt||F = 0. We have

l=ν ||Dl 1 Dφl||F (56) lim K ν

c1(ρ)l + c2

t=0 γt (ρ)l t !

= c2 lim K ν

t=0 γt (ρ)l t = c2 lim K ν

l=max(t+1,ν) γt (ρ)l t

= c2 lim K ν

l=ν γt (ρ)l t + c2 lim K ν

l=t+1 γt (ρ)l t

l=ν γt (ρ)l t + c2 1 ρ lim K ν

| {z } =0 by (90)

c2 1 ρ lim K ν

t=0 γt (ρ)ν t (104) = 0.

This proves limν ||Xiν Xν||F = 0 and thus limν e Eiν,ν = 0. We can now prove that (90) leads to a contradiction. Since e Eiν,ν ν 0, there exists a suﬃciently large integer ν3 K, such that ν3 > ν2 and e Eiν,ν < δ, for all ν > ν3. Deﬁne δ

such that 0 < δ δ e Eiν,ν. Using (43) and 1 φν = I, (91) implies

t=ν γt || e Dt||F (87) (88) 2δ

t=ν γt, (93)

for all K ν > ν3. Equation (93) contradicts (90). Hence, it must be lim supν || e Dν||F = 0, and thus limν || e Dν Dν||F = 0.

A.3. Proof of Theorem 3

(a) Rate of consensus error. Fix θ (0, 1). Combining (50) and (56), we obtain

l=1 γν l (ρ)l = c8

l=1 γν l (ρ)l + c8

l= (1 θ)ν +1 γν l (ρ)l

c8 γν (1 θ)ν (1 θ)ν X

l=0 (ρ)l + c8 γ0 ν X

l= (1 θ)ν +1 (ρ)l,

(a) =c8 γ θν 1 (ρ) (1 θ)ν +1

1 ρ + c8 γ0 (ρ) (1 θ)ν +1 1 (ρ) θν

c9 γ θν + (ρ)(1 θ)ν (b) c9

γ θν + (ρ) 1 θ

θ θν 1 (c) = O γ θν ,

for some positive constants c8 and c9, where in (a) we used ν (1 θ)ν = θν ; (b) follows from x x 1, x R; and in (c) we used ( ρ)ν = o(γν), for any ρ (0, 1). This proves statement (a).

Daneshmand, Sun, Scutari, Facchinei, Sadler

(b) Rate of optimization errors. In the following we will use the shorthand: b Xν i b Xi(D ν, Xν) and b Xν b X(D ν, Xν), with b Xi(D ν, Xν) and b X(D ν, Xν) deﬁned in (29); and b Dν b D(D ν, Xν), with b D(D ν, Xν) deﬁned in (28). 1) Proof of (32): We begin bounding X(D ν, Xν) as

1 2( X(D ν, Xν))2 (a) K1

2 || b Xν Xν||2 F K1||Xν+1 b Xν||2 F + K1||Xν+1 Xν||2 F , (95)

where (a) holds by equivalence of the norms with K1 being a proper positive constant. By squaring both sides of (78) and using 1

n(Pn i=1 ai)2 Pn i=1 a2 i , ai R (by Jensen inequality), the ﬁrst term on the RHS of (95) can be bounded as

1 3K2 ||Xν+1 i b Xν i ||2 F Xν+1 i Xν i 2 F + D ν Dν (i) 2

F + (γν)2, (96)

for some positive constant K2 > 0. Summing (96) over i = 1, . . . , I, yields

1 3K2 ||Xν+1 b Xν||2 F Xν+1 Xν 2 F + Dν 1 D ν 2 F + I(γν)2

Xν+1 Xν 2 F + 4 Dν 1 Dφν 2 F + I(γν)2. (97)

Finally, combining (95) and (97), yields

1 2K1 ( X(D ν, Xν))2 (3K2 + 1) Xν+1 Xν 2 F + 12K1 Dν 1 Dφν 2 F + 3K2I(γν)2.

(98) It follows from (98) together with (76), (52) and Assumption E

ν=0 ( X(D ν, Xν))2 < .

By the deﬁnition of TX,ϵ, there holds

ν=0 ( X(D ν, Xν))2 < ,

which proves (32). 2) Proof of (33): Following the same approach as above, we can bound D(D ν, Xν) as

1 3( D(D ν, Xν))2 K3

3 || b Dν D ν||2 F K3|| b Dν e Dν (i)||2 F +K3|| e Dν (i) Dν (i)||2 F +K3||Dν (i) D ν||2 F ,

(99) for some K3 > 0. Using (83), the ﬁrst term on the RHS of (99) can be bounded as

1 4K4 e Dν (i) b Dν 2 F e Dν (i) Dν (i) 2 F + Θν (i) 1

I DF(Dφν, Xν)

F + D ν Dν (i) 2 F

+ Dν 1 Dφν 2

Decentralized Dictionary Learning Over Time-Varying Digraphs

with some constant K4 > 0. Summing (100) over i = 1, . . . , I, we get

1 4K4 e Dν 1 b Dν 2 F

e Dν Dν 2 F + Θν 1 1

I DF(Dφν, Xν)

F + Dν 1 D ν 2 F + I Dν 1 Dφν 2 F

e Dν Dν 2 F + Θν 1 1

I DF(Dφν, Xν)

F + (4 + I) Dν 1 Dφν 2 F .

(101) Summing (99) over i = 1, . . . , I and using (101), yields

I 3K3 ( D(D ν, Xν))2

(4K4 + 1) e Dν Dν 2 F + 4K4

I DF(Dφν, Xν)

F + 4((4 + I)K4 + 1) Dν 1 Dφν 2 F .

It follows from (102) together with (81), (84), and (52)

ν=0 γν( D(D ν, Xν))2 < .

By deﬁnition of TD,ϵ and non-increasing property of {γν}ν, we get

γTD,ϵ TD,ϵ ϵ2

ν=0 γν( D(D ν, Xν))2 < . (103)

Using γν = K/νp, with some constant K > 0 and p (1/2, 1), (103) provides the desired result as in (33).

A.4. Miscellaneous results

This section contains some miscellaneous results used in the proofs of Theorems 2 and 3.

A.4.1. Sequence properties

The following lemma summarizes some summability properties of suitably chosen sequences, which appear in some of the proofs.

Lemma 13 Given the sequences {aν}ν and {bν}ν, and a scalar λ [0, 1), the following hold:

(a) If limν aν = 0, then,

t=1 at(λ)ν t = 0. (104)

Daneshmand, Sun, Scutari, Facchinei, Sadler

(b) If limν Pν t=1 at 2 < and limν Pν t=1 bt 2 < , then

t=1 atbl(λ)l t < , (105)

at 2 (λ)l t < , (106)

t=l at(λ)t l !2

Proof For the proof of (a) and (105)-(106) in (b), see (Nedi c et al., 2010, Lemma 7). We prove next (107). Expand the LHS of (107) as:

t=l at(λ)t l !2

k=l atak(λ)t l(λ)k l

(at)2 + (ak)2

2 (λ)t l(λ)k l = lim ν,ν

t=l (at)2(λ)t l ν X

k=l (λ)k l,

where the inequality is due to a b (a2 + b2)/2. Using the bound on the sum of the geometric series, the above inequality yields

t=l at(λ)t l !2

1 1 λ lim ν,ν

t=l (at)2(λ)t l = 1 1 λ lim ν

l=1 (at)2(λ)t l

1 1 λ lim ν

t=1 (at)2 t X

l=1 (λ)t l 1 (1 λ)2 lim ν

t=1 (at)2 < .

A.4.2. On the properties of the best-response map

Some key properties of the best-response maps deﬁned in (8) and (11) are summarized and proved next.

Proposition 14 Let Dν, Xν

ν be the sequence generated by the D4L Algorithm, in the setting of Theorem 2(a). Given the solution maps deﬁned in (8) and (11), the following hold: (a) There exist some constants s D > 0 and η > 0, and a sequence {T ν}ν, with P ν=1 (T ν)2 < , such that: for all ν 1,

D DF(Dφν, Xν),

i=1 φν i e Dν (i) Dν (i) E +

i=1 φν i G( e Dν (i)) G(Dν (i))

s D || e Dν Dν||F I ϵφ

2 s D T ν 2 + η || e Dν Dν||F

t=1 (ρ)ν t||Xt Xt 1||F + I2 ϵ2 φ 4 s D (T ν)2 ,

Decentralized Dictionary Learning Over Time-Varying Digraphs

where ρ (0, 1) and ϵφ are deﬁned in (48) and (49), respectively; (b) There exist ﬁnite constants s X > 0 and LX > 0, such that: for all ν 1,

D Xifi(Dφν+1, Xν i ), Xν+1 i Xν i E +

gi(Xν+1 i ) gi(Xν i )

i=1 τ ν X,i||Xν+1 i Xν i ||2 F + LX||Uν 1 Uφν||F ||Xν+1 Xν||F .

Proof (a) It follows from the optimality of e Dν (i) [cf. (8)] and convexity of G that

D D fi( e Dν (i); Dν (i), Xν i ) + IΘν (i) Dfi(D(i), Xi), Dν (i) e Dν (i) E + G(Dν (i)) G( e Dν (i)) 0. (110) Adding and subtracting inside the ﬁrst term P j Dfj(Dφν, Xν j ) and using D fi(Dν (i); Dν (i), Xν i ) = Dfi(Dν (i), Xν i ) [cf. Remark 7], inequality (110) becomes

D D fi( e Dν (i); Dν (i), Xν i ) D fi(Dν (i); Dν (i), Xν i ), e Dν (i) Dν (i) E

+ D I Θν (i)

j=1 Dfj(Dφν, Xν j ), e Dν (i) Dν (i) E

j=1 Dfj(Dφν, Xν j ), e Dν (i) Dν (i) E + G( e Dν (i)) G(Dν (i)) 0.

Invoking the uniform strongly convexity of fi( ; Dν (i), Xν i ), the deﬁnition of Θν (i) in (13), and

recalling that DF(Dφν, Xν) = P j Dfj(Dφν, Xν j ), we get

D DF(Dφν, Xν), e Dν (i) Dν (i) E + G( e Dν (i)) G(Dν (i))

τ ν D,i e Dν (i) Dν (i) 2 + I

j=1 Dfj(Dφν, Xν j )

e Dν (i) Dν (i) F .

Multiplying both side of the above inequality by the positive quantities φν i and summing over i = 1, 2, . . . , I while using φν i ϵφ [cf. (49)], yields

D DF(Dφν, Xν),

i=1 φν i e Dν (i) Dν (i) E +

i=1 φν i G( e Dν (i)) G(Dν (i))

s D|| e Dν Dν||2 + I ϵφ

i=1 Dfi(Dφν, Xν i )

F | {z } gradient tracking error

e Dν Dν F ,

Daneshmand, Sun, Scutari, Facchinei, Sadler

where s D is any positive constant such that s D mini,ν φν i τ ν D,i [note that such a constant exists because φν i ϵφ, with ϵφ > 0 deﬁned in (49), and all τ ν D,i are uniformly bounded away from zero see Assumption D1]. Now let us bound the gradient tracking error term in (111). Using (40) recursively, Θν

can be rewritten as

Θν = c Wν 1:0 Θ0 +

t=1 c Wν 1:t bΦ t 1 Gt Gt 1 + bΦ ν 1 Gν Gν 1 . (112)

Using the deﬁnition of Gν [cf. (34)] and b J [cf. (45)], write

i=1 Dfi(Dν (i), Xν i ) = b JGν = b JG0 +

t=1 b J Gt Gt 1 ,

which, using Θ0 = G0, leads to the following expansion for 1 1

I PI i=1 Dfi(Dφν, Xν i ):

i=1 Dfi(Dφν, Xν i ) = b J eΘ 0 +

t=1 b J Gt Gt 1

Dfi(Dφν, Xν i ) Dfi(Dν (i), Xν i ) .

Using (112) and (113), the gradient tracking error term in (111) can be upper bounded as

i=1 Dfi(Dφν, Xν i ) F

(a) c Wν 1:0 b J 2

c Wν 1:t b Jφt 2

+ bΦ ν 1 b J 2

Gν Gν 1 F + 1

Dfi(Dφν, Xν i ) Dfi(Dν (i), Xν i ) F

(b) c4 (ρ)ν + c5 L

t=1 (ρ)ν t Dt Dt 1 F + Xt Xt 1 F + L Dν 1 Dφν F

(c) = T ν + c5 L

t=1 (ρ)ν t Xt Xt 1 F ,

(114) for some positive ﬁnite constants c4 and c5, where in (a) we used the lower bound φν i ϵφ [cf. (49)] and b Jφt = b J bΦ t [cf. (47)]; and in (b) we used (34), (63) (cf. Remark 8), and Lemma 4; and in (c) we deﬁned T ν as

T ν c4 (ρ)ν + c5 L

t=1 (ρ)ν t Dt Dt 1 F + L Dν 1 Dφν F . (115)

Decentralized Dictionary Learning Over Time-Varying Digraphs

Substituting (114) into (111) yields D DF(Dφν,Xν),

i=1 φν i e Dν (i) Dν (i) E +

i=1 φν i G( e Dν (i)) G(Dν (i))

s D|| e Dν Dν||2 F + I ϵφT ν|| e Dν Dν||F

+ I ϵφ c5 L || e Dν Dν||F

t=1 (ρ)ν t Xt Xt 1 F

|| e Dν Dν||F I ϵφ

2 s D T ν 2 + I2 ϵ2 φ 4 s D (T ν)2

+ η || e Dν Dν||F

t=1 (ρ)ν t Xt Xt 1 F ,

with η = I ϵφc5L . To complete the proof, we need to show that P ν=1 (T ν)2 < . Note that the ﬁrst term on the RHS of (115) is square summable, and so is the third one, due to Proposition 5 [cf. (52)]. Invoking Jensen s inequality, it is suﬃcient to show that the second term on RHS of (115) is square summable. Following the same approach used to prove (52), we have

l=1 (ρ)t l Dl Dl 1 F

k=1 (ρ)t l(ρ)t k Dl Dl 1 F

(a) 1 1 ρ lim ν

l=1 (ρ)t l(γl)2 e Dl 1 Dl 1 2

where (a) follows from ab (a2 + b2)/2, and (b) is due to Lemma 13 and Assumption A3. Hence P ν=1 (T ν)2 < . (b) We prove this statement using the deﬁnition (15) of fi; the same conclusion holds also using the alternative choice (14) of fi; the proof is thus omitted. Invoking the optimality of Xν+1 [cf. (11)] together with the convexity of gi, yield D Xi hi(Xν+1 i ; Uν (i), Xν i ) Xi hi(Xν i ; Uν (i), Xν i ), Xν i Xν+1 i E

+ D Xi hi(Xν i ; Uν (i), Xν i ) Xi hi(Xν i ; Uφν, Xν i ), Xν i Xν+1 i E

+ D Xi hi(Xν i ; Uφν, Xν i ), Xν i Xν+1 i E + gi (Xν i ) gi(Xν+1 i ) 0.

Using Remark 7 and Uφν = Dφν+1 [cf. (42)], we obtain D Xifi(Dφν+1, Xν i ), Xν+1 i Xν i E + gi(Xν+1 i ) gi (Xν i )

τ ν X,i Xν+1 i Xν i 2 F + D Xifi(Uφν, Xν i ) Xifi(Uν (i), Xν i ), Xν+1 i Xν i E ,

which, together with (63), leads to the desired result (109), with LX =

Daneshmand, Sun, Scutari, Facchinei, Sadler

A.4.3. Proof of Proposition 10

We begin observing that

G(Dφν+1) G(Dφν) + γν

i=1 φν i G( e Dν (i)) G(Dν (i)) + γν

i=1 φν i G(Dν (i)) G(Dφν) ,

(116) due to (43) and the convexity of G [together with 1 φν = 1]. Invoking the descent lemma for fi(Uφν, ) and using Uφν = Dφν+1 [cf. (42)], we get: for suﬃciently large ν, say ν ν0, U(Dφν+1, Xν+1)

n fi(Dφν+1, Xν i ) + gi(Xν i ) o + G(Dφν+1) +

gi(Xν+1 i ) gi(Xν i )

D Xifi(Dφν+1, Xν i ), Xν+1 i Xν i E + 1

i=1 L Xi(Uφν) Xν+1 i Xν i 2 F

(a) U(Dφν+1, Xν)

2L Xi(Uφν) Xν+1 i Xν i 2 F

+ LX||Uν 1 Uφν||F ||Xν+1 Xν||F

(b) U(Dφν+1, Xν) s X||Xν+1 Xν||2 F + LX||Uν 1 Uφν||F ||Xν+1 Xν||F ,

where in (a) we used Proposition 14(b); and in (b) s X > 0 is a constant such that infν ν0(τ ν X,i 1

2L Xi(Uφν)) s X, for all i = 1, . . . , I. Note that such a constant exists because of (58) and Assumption D1. To upper bound U(Dφν+1, Xν), we apply the descent lemma to F( , Xν). Recalling that DF( , Xν) is Lipschitz continuous with constant L D and using (43), we get

U(Dφν+1, Xν) F(Dφν, Xν) + γν

D DF(Dφν, Xν),

i=1 φν i e Dν (i) Dν (i) E

i=1 φν i e Dν (i) Dν (i)

i=1 gi(Xν i ) + G(Dφν+1),

(a) U(Dφν, Xν) s D γν

|| e Dν Dν||F I ϵφ

I || e Dν Dν||F

t=1 (ρ)ν t||Xt Xt 1||F + I ϵ2 φγν

4s D (T ν)2

2 (γν)2 || e Dν Dν||2 F + γν

i=1 φν i G(Dν (i)) G(Dφν)

| {z } LGγν Dν 1 Dφν F

(118) where in (a) we used (116) and Proposition 14(a), and the Lipschitz continuity of G, due to the convexity of G (G is thus locally Lipschitz continuous) and the compactness of D; we denoted by LG > 0 the Lipschitz constant.

Decentralized Dictionary Learning Over Time-Varying Digraphs

Combining (117) with (118) and deﬁning τ ν D s D γν IL D

2 , we get: for ν ν0,

U(Dφν+1, Xν+1) U(Dφν, Xν) s X||Xν+1 Xν||2 F + LX||Uν 1 Uφν||F ||Xν+1 Xν||F

|| e Dν Dν||F I ϵφ

2 τ ν D T ν 2 + I ϵ2 φ γν

4 τ ν D (T ν)2

I || e Dν Dν||F

t=1 (ρ)ν t||Xt Xt 1||F

| {z } c5 (ρ)ν+ρ 1 Pν t=1(ρ)ν t||Xt+1 Xt||F

+LG γν Dν 1 Dφν F ,

(119) where c5 (ρ) 1 ||X1 X0||F . Since γν 0, there exists an integer ν1 ν0 and some τD such that τ ν D τD > 0, for all ν ν1. Let ν be any integer ν ν1. Then, applying (119) recursively on ν, ν 1, . . . , ν + 1, ν, and using the boundedness of {|| e Dν Dν||F }ν, we obtain

U(Dφν+1, Xν+1)

U(Dφ ν, X ν) s X

l= ν ||Xl+1 Xl||2 F + LX

l= ν ||Ul 1 Uφl||F ||Xl+1 Xl||F

l= ν γl || e Dl Dl||F I ϵφ

2 τD T l 2 + I ϵ2 φ 4 τD

l= ν γl T l 2

l= ν γl(ρ)l + c7

t=1 γl(ρ)l t Xt+1 Xt F + LG

l= ν γl Dl 1 Dφl F ,

(120) for some ﬁnite constants c6, c7 > 0. Using the boundedness of {||Xν+1 Xν||F }ν (cf. Step 2), i.e., ||Xν+1 Xν||F BX, for all ν and some BX > 0, we can bound the double-sum term on the RHS of (120) as

t=1 γl(ρ)l t Xt+1 Xt F =

t=max( ν,l) γt(ρ)t l Xl+1 Xl F

t= ν γt(ρ)t l +

t=l γt(ρ)t l

(a) BX max ν t ν γt (1 (ρ) ν) 1 (ρ)ν ν+1

t=l γt(ρ)t l

max t ν γt +

t=l γt(ρ)t l,

where in (a) we used the summability of Pν t=l γt(ρ)t l, and the following bound

t= ν γt(ρ)t l max t ν γt (1 (ρ) ν) 1 (ρ)ν ν+1

Daneshmand, Sun, Scutari, Facchinei, Sadler

Substituting (121) in (120), yields

U(Dφν+1, Xν+1) U(Dφ ν, X ν) s X

t=l γt(ρ)t l + LX||Ul 1 Uφl||F

l= ν γl || e Dl Dl||F I ϵφ

2 τD T l 2 + I ϵ2 φ 4 τD

l= ν γl T l 2

1 ρ (ρ) ν (ρ)ν+1 + c7BX (1 ρ)2

| {z } Eν, ν

l= ν γl Dl 1 Dφl F

which complete the proof.

A.4.4. Proof of Lemma 11

The result follows by squaring both sides of eq. (114) [in the proof of Proposition 14 (a)], using P ν=1(T ν)2 < [cf. Proposition 14 (a)], and

l=1 (ρ)t l Xl Xl 1 F

k=1 (ρ)t l(ρ)t k Xl Xl 1 F

(a) 1 1 ρ lim ν

l=1 (ρ)t l Xl Xl 1 2

where (a) follows from ab (a2 + b2)/2, a, b R, and (b) is due to (76) and Lemma 13.

A.4.5. Proof of Proposition 12

(a) Using the optimality of e Dν (i) deﬁned in (8) together with convexity of G, yields D D fi( e Dν1 (i); Dν1 (i), Xν1 i ) + I Θν1 (i) Dfi(Dν1 (i), Xν1 i ), e Dν2 (i) e Dν1 (i) E + G( e Dν2 (i)) G( e Dν1 (i)) 0, D D fi( e Dν2 (i); Dν2 (i), Xν2 i ) + I Θν2 (i) Dfi(Dν2 (i), Xν2 i ), e Dν1 (i) e Dν2 (i) E + G( e Dν1 (i)) G( e Dν2 (i)) 0.

Summing the two inequalities above while adding/subtracting inside the inner product D fi( e Dν2 (i); Dν1 (i), Xν1 i ) and using (13), yield D D fi( e Dν2 (i); Dν1 (i), Xν1 i ) D fi( e Dν2 (i); Dν2 (i), Xν2 i ), e Dν2 (i) e Dν1 (i) E

D Dfi(Dν1 (i), Xν1 i ) Dfi(Dν2 (i), Xν2 i ), e Dν2 (i) e Dν1 (i) E + I D Θν1 (i) Θν2 (i), e Dν2 (i) e Dν1 (i) E

D D fi( e Dν1 (i); Dν1 (i), Xν1 i ) D fi( e Dν2 (i); Dν1 (i), Xν1 i ), e Dν1 (i) e Dν2 (i) E τ ν1 D,i || e Dν2 (i) e Dν1 (i)||2 F , (122)

Decentralized Dictionary Learning Over Time-Varying Digraphs

where the second inequality follows from the τ ν1 D,i-strong convexity of fi( ; Dν1 (i), Xν1 i ) [cf. Remark 7]. To bound the ﬁrst term on the LHS of the above inequality, let us use the expression (15) of fi, and write

D fi( e Dν2 (i); Dν1 (i), Xν1 i ) D fi( e Dν2 (i); Dν2 (i), Xν2 i )

= Dfi(Dν1 (i), Xν1 i ) Dfi(Dν2 (i), Xν2 i ) + τ ν1 D,i Dν2 (i) Dν1 (i) + (τ ν1 D,i τ ν2 D,i) e Dν2 (i) Dν2 (i) . (123) Substituting (123) in (122), we get

e Dν2 (i) e Dν1 (i)

|τ ν1 D,i τ ν2 D,i|

e Dν2 (i) Dν2 (i)

F + Dν1 (i) Dν2 (i)

F + I τ ν1 D,i

Θν1 (i) Θν2 (i)

(a) |τ ν1 D,i τ ν2 D,i|

e Dν2 (i) Dν2 (i)

F + Dφν2 Dφν1 F + Dν1 (i) Dφν1 F + Dν2 (i) Dφν2 F

+ I τ ν1 D,i

i=1 Dfi(Dφν1, Xν1 i )

+ I τ ν1 D,i

i=1 Dfi(Dφν2, Xν2 i )

(Dφν1, Xν1) (Dφν2, Xν2) ,

(124) where (a) holds by add/subtracting average quantities 1

I PI i=1 Dfi(Dφν, Xν i ) and D ν (i), triangle inequality, and invoking the Lipschitz continuity bound (63). By the compactness of D, we have || e Dν2 (i) Dν2 (i)||F BD, for some ﬁnite BD > 0. Furthermore, by Assumption D, τ ν D,i is convergent to some τ D,i > 0 and there exists a suﬃciently small s D > 0 such that s D τ ν D,i s 1 D , for all ν and i. Thus (124) gives

e Dν2 (i) e Dν1 (i)

s D + 1 (Dφν1, Xν1) (Dφν2, Xν2) + T ν1 i + T ν2 i (125)

with T ν i I s D || eΘ ν i 1

I PI i=1 Dfi(Dφν, Xν i )||F + ||Dν (i) Dφν||F + BD

s D |τ ν D,i τ D,i|. Con-

vergence of τ ν D,i (Assumption D2), Lemma 11 and Proposition 5 yield T ν i 0 as ν .

Summing (125) leads to the desired result, with LD = I( L D

s D + 1) and T ν PI i=1 T ν i . It is clear that the claim also holds when (14) is chosen for fi [speciﬁcally, trivial extension is to modify the RHS of (123) and following a similar steps as in the rest of the proof]; we omit further details. (b) We prove (86) when hi is given by (17); we leave the proof under (16) to the reader, since it is almost identical to that under (17).

Daneshmand, Sun, Scutari, Facchinei, Sadler

Invoking optimality of each Xν i and Xν+1 i deﬁned in (11) while using the strong convexity of hi( ; Uν (i), Xν i ) and gi s, it is not diﬃcult to show that the following holds:

(τ ν X,i + µi) Xν+1 i Xν i 2 F D Xi hi(Xν i ; Uν (i), Xν i ) Xi hi(Xν i ; Uν 1 (i) , Xν 1 i ), Xν i Xν+1 i E .

Using (17), the deﬁnition Υν i (Xi) τ ν X,i Xi Xifi(Uν (i), Xi), and the Cauchy-Schwarz inequality, the above inequality yields

(τ ν X,i + µi) Xν+1 i Xν i 2 F Υν i (Xν i ) Υν i (Xν 1 i ) F Xν+1 i Xν i F

+ Xifi(Uν (i), Xν 1 i ) Xifi(Uν 1 (i) , Xν 1 i ) F

Xν+1 i Xν i F

+ |τ ν X,i τ ν 1 X,i | Xν i Xν 1 i F Xν+1 i Xν i F . (126) Following the same steps used to prove (61), it is not diﬃcult to check that, under Assumption D1, ||Υν i (Xν i ) Υν i (X0 i )||F τ ν X,i||Xν i X0 i ||F . Using in (126) this bound together with the Lipschitz continuity of Xifi( , Xν 1 i ) [cf. Remark 8] and summing over i, yield

||Xν+1 Xν||2 F =

i=1 ||Xν+1 i Xν i ||2 F

τ ν X,i + |τ ν X,i τ ν 1 X,i |

τ ν X,i + µi ||Xν i Xν 1 i ||F ||Xν+1 i Xν i ||F

L τ ν X,i + µi ||Uν (i) Uν 1 (i) ||F ||Xν+1 i Xν i ||F .

pν X max i τ ν X,i + |τ ν X,i τ ν 1 X,i |

τ ν X,i + µi , and qν X max i L τ ν X,i + µi . (128)

Note that 0 < pν X, qν X < . Then, (127) becomes

||Xν+1 Xν||2 F pν X

i=1 ||Xν i Xν 1 i ||F ||Xν+1 i Xν i ||F + qν X

i=1 ||Uν (i) Uν 1 (i) ||F ||Xν+1 i Xν i ||F

pν X||Xν Xν 1||F ||Xν+1 Xν||F + qν X||Uν Uν 1||F ||Xν+1 Xν||F ,

where in the last inequality we used P i aibi ||a|| ||b||. Therefore,

||Xν+1 Xν||F pν X||Xν Xν 1||F + qν X||Uν Uν 1||F . (129)

If, in addition, Assumption D2 holds, then it follows from (128) that there exists a suﬃciently large νX > 0 such that pν X (0, δ0), for all ν νX and some δ0 (0, 1).

Decentralized Dictionary Learning Over Time-Varying Digraphs

University of Southern California, Signal and image processing institute. Volume 3: Miscellaneous image database, 1997. Available online: http://sipi.usc.edu/database/ database.php?volume=misc.

M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11): 4311 4322, November 2006.

A. Aravkin, S. Becker, V. Cevher, and P. Olsen. A variational approach to stable principal component pursuit. In Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 14, pages 32 41, Arlington, Virginia, United States, 2014. AUAI Press. ISBN 978-0-9749039-1-0.

D. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods. Athena Scientiﬁc, 1997.

P. Bianchi and J. Jakubowicz. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control, 58(2): 391 405, February 2013.

T. Bouwmans, A. Sobral, S. Javed, S. K. Jung, and E. Zahzah. Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset. Computer Science Review, 23:1 71, February 2017.

E. J. Cand es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Journal of the ACM, 58(3):1 37, June 2011.

P. Chainais and C. Richard. Learning a common dictionary over a sensor network. In Proceedings of the 2013 5th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 133 136, Saint Martin, French West Indies, France, December 2013.

J. Chen, Z. J. Towﬁc, and A. H. Sayed. Dictionary learning over distributed models. IEEE Transactions on Signal Processing, 63(4):1001 1016, February 2015.

S. Chouvardas, Y. Kopsinis, and S. Theodoridis. An online algorithm for distributed dictionary learning. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3292 3296, Brisbane, Queensland, Australia, April 2015.

A. Daneshmand, F. Facchinei, V. Kungurtsev, and G. Scutari. Hybrid random/deterministic parallel algorithms for convex and nonconvex big data optimization. IEEE Transactions on Signal Processing, 63(13):3914 3929, August 2015.

A. Daneshmand, G. Scutari, and F. Facchinei. Distributed dictionary learning. In Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers, pages 1001 1005, November 2016.

Daneshmand, Sun, Scutari, Facchinei, Sadler

P. Di Lorenzo and G. Scutari. NEXT: In-network nonconvex optimization. IEEE Transactions on Signal and Information Processing over Networks, 2(2):120 136, June 2016.

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12):3736 3745, December 2006.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2): 303 338, June 2010.

F. Facchinei, G. Scutari, and Simone Sagratella. Parallel selective algorithms for nonconvex big data optimization. IEEE Transactions on Signal Processing, 63(7):1874 1889, April 2015.

T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity. CRC Press, Taylor & Francis Group, 2015.

M. Hong, D. Hajinezhad, and M. M. Zhao. Prox-PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70, pages 1529 1538, Sydney, Australia, August 2017.

P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5:1457 1469, December 2004.

Z. Jiang, Z. Lin, and L. S. Davis. Learning a discriminative dictionary for sparse coding via label consistent k-svd. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 11, pages 1697 1704, Colorado Springs, CO, USA, June 2011.

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computation of aggregate information. In Proceedins of the 44th Annual IEEE Symposium on Foundations of Computer Science, pages 482 491, Cambridge, MA, USA, October 2003.

J. Kim. High-radix Interconnection Networks. Ph D thesis, Stanford, CA, USA, 2008.

D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791, October 1999.

M. Lee, H. Shen, J. Z. Huang, and J. S. Marron. Biclustering via sparse singular value decomposition. Biometrics, 66(4):1087 1095, December 2010.

J. Liang, M. Zhang, X. Zeng, and G. Yu. Distributed dictionary learning for sparse representation in sensor networks. IEEE Transactions on Image Processing, 23(6):2528 2541, June 2014.

S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1): 24 45, January 2004.

Decentralized Dictionary Learning Over Time-Varying Digraphs

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Supervised dictionary learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 1033 1040, Lake Tahoe, NV, December 2008.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19 60, January 2010.

A. Nedi c, A. Ozdaglar, and P. A. Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922 938, April 2010.

H. Raja and W. U. Bajwa. Cloud K-SVD: Computing data-adaptive representations in the cloud. In Proceedings of 51st Annual Allerton Conference, pages 1474 1481, Allerton House, UIUC, Illinois, USA, October 2013.

M. Razaviyayn, M. Hong, Z. Luo, and J. Pang. Parallel successive convex approximation for nonsmooth nonconvex optimization. In NIPS, 2014a.

M. Razaviyayn, H. W. Tseng, and Z. Q. Luo. Dictionary learning for sparse representation: Complexity and algorithms. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5247 5251, Florence, Italy, May 2014b.

B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, August 2010.

R.T. Rockafellar and J.B. Wets. Variational Analysis. Springer, 1998.

G. Scutari and Y. Sun. Parallel and distributed successive convex approximation methods for big-data optimization. In F. Facchinei and J.-S. Pang, editors, Multi-agent Optimization, chapter 3, pages 141 308. Lecture Notes in Mathematics 2224, Springer, 2018.

G. Scutari and Y. Sun. Distributed nonconvex constrained optimization over time-varying digraphs. Mathematical Programming, to appear 2019.

G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang. Decomposition by partial linearization: Parallel optimization of multiuser systems. IEEE Transaction on Signal Processing, 62:641 656, February 2014.

H. Shen and J. Z. Huang. Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis, 6(99):1015 1034, July 2008.

N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. In Proceedings of the Learning Theory: 18th Annual Conference on Learning Theory (COLT), pages 545 560, Bertinoro, Italy, June 2005.

Y. Sun, G. Scutari, and D. Palomar. Distributed nonconvex multiagent optimization over time-varying networks. In Proceedings of the 2016 50th Asilomar Conference on Signals, Systems and Computers, pages 788 794, November 2016.

Daneshmand, Sun, Scutari, Facchinei, Sadler

K. K. Sung. Learning and Example Selection for Object and Pattern Recognition. Ph D thesis, Artiﬁcial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, Cambridge, MA, 1996.

T. Tatarenko and B. Touri. Non-convex distributed optimization. IEEE Trans. on Automatic Control, 62:3744 3757, August 2017.

I. Tosic and P. Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2): 27 38, March 2011.

M. Udell, C. Horn, R. Zadeh, and S. Boyd. Generalized low rank models. Foundations and Trends in Machine Learning, 9(1):1 118, June 2016.

H. T. Wai, T. H. Chang, and A. Scaglione. A consensus-based decentralized algorithm for non-convex optimization with application to dictionary learning. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3546 3550, Brisbane, Queensland, Australia, April 2015.

H. T. Wai, J. Lafond, A. Scaglione, and E. Moulines. Decentralized frankwolfe algorithm for convex and nonconvex problems. IEEE Transaction on Automatic Control, 62:5522 5537, November 2017.

L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In Proceedings of the 4th international symposium on Information processing in sensor networks, pages 63 70, Los Angeles, CA, April 2005.

L. Xiao, S. Boyd, and S. J. Kim. Distributed average consensus with least-mean-square deviation. Journal of Parallel and Distributed Computing, 67(1):33 46, January 2007.

M. Zhao, Q. Shi, and M. Hong. A distributed algorithm for dictionary learning over networks. In Proceedings of 2016 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 505 509, December 2016.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301 320, March 2005.