# optimal_dictionary_for_least_squares_representation__7ce3f568.pdf

Journal of Machine Learning Research 18 (2017) 1-28 Submitted 1/16; Revised 5/17; Published 10/17

Optimal Dictionary for Least Squares Representation

Mohammed Rayyan Sheriﬀ mohammedrayyan@sc.iitb.ac.in Debasish Chatterjee dchatter@iitb.ac.in Systems and Control Engineering IIT Bombay Mumbai 400076, India

Editor: Benjamin Recht

Dictionaries are collections of vectors used for the representation of a class of vectors in Euclidean spaces. Recent research on optimal dictionaries is focused on constructing dictionaries that oﬀer sparse representations, i.e., ℓ0-optimal representations. Here we consider the problem of ﬁnding optimal dictionaries with which representations of a given class of vectors is optimal in an ℓ2-sense: optimality of representation is deﬁned as attaining the minimal average ℓ2-norm of the coeﬃcients used to represent the vectors in the given class. With the help of recent results on rank-1 decompositions of symmetric positive semideﬁnite matrices, we provide an explicit description of ℓ2-optimal dictionaries as well as their algorithmic constructions in polynomial time.

Keywords: ℓ2-optimal dictionary, rank-1 decomposition, ﬁnite tight frames

1. Introduction

A dictionary is a collection of vectors in a ﬁnite-dimensional vector space over R, with which other vectors of the vector space are represented. A dictionary is a generalization of a basis: While the number of vectors in a basis is exactly equal to the dimension of the vector space, a dictionary may contain more elements. In this article we consider a problem of ﬁnding an optimal dictionary, where optimality is interpreted as the minimum expected average size of the coeﬃcients required to represent a certain collection of vectors drawn from a given probability distribution.

We begin with a toy example to motivate the problems treated in this article. Let V be a random vector that attains values close to 0 2 J with high probability; the situation is demonstrated below:

c 2017 Mohammed Rayyan Sheriﬀand Debasish Chatterjee.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-046.html.

Mohammed Rayyan Sheriff and Debasish Chatterjee

Figure 1: Comparison of two dictionaries.

Suppose that our dictionary consists of the vectors d1 1 ϵ J and d2 1 ϵ J

in R2, with a small positive value of ϵ. Since we must represent V using d1 and d2, the corresponding coeﬃcients α1 and α2 must be such that α1 1 ϵ J α2 1 ϵ J V 0 2 J. A quick calculation shows that the magnitudes of the coeﬃcients α1 and α2 should then be approximately equal to 1{pϵq with high probability. To wit, the magnitudes of these coeﬃcients are large for small values of ϵ. It is therefore more appropriate in this situation to consider a dictionary consisting of vectors d 1 ϵ 1 J and d 2 ϵ 1 J to represent the samples of V , in which case, the magnitudes of the coeﬃcients of the representations are closer to 1 with high probability. The latter values are comparatively far smaller compared to the values close to 1{pϵq obtained with the preceding dictionary. This simple example shows that given some statistical information about the random vectors to be represented, the question of designing a dictionary that minimizes the average cost of representation can be better addressed. Let us now turn to a situation in which considering the average cost of representations is natural. Our motivation comes from a control theoretic ideas perspective. Consider a linear time-invariant control system modeled by the recursion

xpt 1q Axptq Buptq, t 0, 1, . . . , (1)

where the system matrix A P Rnˆn and the control matrix B P Rnˆm are given, with the initial boundary condition xp0q x P Rn ﬁxed. For an arbitrarily selected ˆx P Rn, consider the standard reachability problem for (1), that is:

If possible, ﬁnd a sequence puptqqt Ă Rm of control vectors

that steer the system states to ˆx. (2)

A necessary and suﬃcient condition for such a sequence to exist for every pair p x, ˆxq is that the rank of the matrix RKp A, Bq : B AB An 1B is equal to n, which we impose for the moment. Letting K : min k ě 0 ˇˇ rank p RKp A, Bqq n ( denote the reachability

Optimal Dictionary for Least Squares Representation

index of (1), we see at once that the control vectors puptqq K 1 t 0 needed to execute the transfer of the states of (1) from x to ˆx must be a solution to the linear equation

t 0 At Buptq RKp A, Bq

up K 1q ... up1q up0q

It is now natural to consider the control cost of transferring x to ˆx, for which, a natural candidate is the associated ℓ2 performance index řK 1 i 0 uptq 2. Since in practice, the ℓ2 performance index is analogous to the amount of energy spent to control the system, its practical importance can hardly be overstated in the context of control. Let us list three examples: In attitude control/orientation problems of space vehicles, one must execute most of the rapid manoeuvre using the energy from the limited amount of fuel on board, or with the energy available from on-board batteries; minimizing the energy expenditure, therefore, is crucial. In controlled automated mobile robots (e.g., automated cars) designed to reach a given location within a certain time, reduction of energy consumption leads directly to reduction in fuel consumed. In control of electronic systems such as power electronic drives, the associated ℓ2 performance index involves information of the amount of power drawn from the electricity grid to control the system, leading directly to minimization of power consumption and thereby heating. Minimization of control eﬀort has been an integral part of control theory, and is generally studied under the class of Linear Quadratic problems; see, e.g., (Bertsekas, 1995), (Anderson and Moore, 2007), (Clarke, 2013), (Liberzon, 2012), or any standard book on optimal control. It is evident that the task of designing control systems that require minimum control energy for their typical manoeuvres is of great importance. It is a standard practice to study the reachability problem (2), for x 0 and ˆx on the unit sphere; due to linearity of (1), this special case provides suﬃcient insight into the general case. Let us consider the following optimal control problem:

minimize puptqqt E K 1 ÿ

xpt 1q Axptq Buptq for all t 0, . . . , K 1, xp0q 0, xp Kq ˆx distributed according to µ,

where µ is a probability distribution on Rd. It is known that if ˆx is uniformly distributed over the unit sphere, then the optimal control problem (3) admits an unique optimal solution and the optimum value is proportional to tr W 1 A,B , where WA,B : RKp A, Bq RKp A, Bq J

is the controllability grammian of the system; for details see, e.g., (Müller and Weber, 1972) and (Pasqualetti et al., 2014). It can be readily shown that if Σ : Erˆxˆx Js is well deﬁned, then the optimum value of (3) is equal to tr ΣW 1 A,B . Evidently, for a given distribution of

Mohammed Rayyan Sheriff and Debasish Chatterjee

ˆx, diﬀerent linear systems (1) described completely by the pair p A, Bq incur diﬀerent optimum values tr ΣW 1 A,B of (3). Against the above backdrop, consider the question of designing the linear control system (1) such that the value of (3) is as low as possible. Since most control problems involve designing control sequences to execute a class of desired manoeuvres, for a given distribution of ˆx it is then natural to design the linear systems in order to minimize the optimum value tr ΣW 1 A,B of the optimal control problem (3). In this case, the system design problem is similar to the one of ﬁnding an ℓ2-optimal dictionary as described above: here the matrices A and B are to be designed, within a feasible region, such that the column vectors constituting the matrix RKp A, Bq lead to minimal expected average cost of reachability, i.e., minimal value of (3). Such problems routinely arise in networked control, where the pair p A, Bq is a function of the constituent systems and the connectivity of the network. From an operational standpoint, it is good for a networked system to have its components connected in a way such that the resulting system incurs small expected average state transfer costs. Indeed, control systems are typically designed (Müller and Weber, 1972) by optimizing a ﬁgure of merit / measure of quality / measure of controllability; in particular, networked control systems are designed in (Pasqualetti et al., 2014) using a measure of quality deﬁned there. Based on this work on ℓ2-optimal dictionaries, we have proposed a novel measure of quality in (Sheriﬀand Chatterjee, 2017), and further developments for algorithmic synthesis of largescale control systems will be reported elsewhere. Besides these applications in control theory and practice, one of the key objective of our work here is to investigate and understand the physical nature of the ℓ2-optimal dictionaries independent of their connection with control theory. Such a study will shed light on other control theoretic properties of observability and estimation. There has been signiﬁcant recent research into ﬁnding optimal dictionaries, brieﬂy outlined in (Tošić and Frossard, 2011); current research centers around the development of learning algorithms for ﬁnding optimal dictionaries. Much of the thrust is on arriving at dictionaries that oﬀer sparse representations of sample vectors. One of the ﬁrst learning algorithms to develop a dictionary that oﬀers sparse representation of images was given in (Olshausen and Field, 1997). Since then many learning algorithms have been developed to obtain dictionaries that oﬀer sparse representation along with other special properties such as online computation capability (Mairal et al., 2009a), better classiﬁcation property (Mairal et al., 2009b; Yang et al., 2011), better adaptive properties (Skretting and Engan, 2010); several other algorithms are given in (K. Delgado et al., 2003; Yaghoobi et al., 2009; Mallat and Zhang, 1993). The problem addressed in this article diﬀers from the mainstream research of ﬁnding dictionaries oﬀering sparse (ℓ0-optimal) representations in the sense that our objective is to ﬁnd dictionaries that give minimum average ℓ2-norm of the coeﬃcient vector used for representation. Intuitively, optimization of the ℓ2-norm of the representation vector tends to distribute the information of the data being represented among all components of the representation vector; this makes the representation robust to accidental changes in the coeﬃcients. An advantage of considering the ℓ2-cost is that it involves a norm arising from an inner product; consequently, it comes with a rich set of properties associated with it. These properties are crucially employed in this article to modify the intrinsically non-convex

Optimal Dictionary for Least Squares Representation

problem of ﬁnding an ℓ2-optimal dictionary into an equivalent convex optimization problem,1 allowing us to compute an optimal dictionary in polynomial time and arrive at analytical expressions of the optimal costs. We provide these algorithms in Sections 4.1 and 4.3. One more advantage of considering optimization in the ℓ2-sense is related to the fact that the ℓ2-cost involves the natural notion of energy which is extremely important in practice, especially in control theoretic applications. The results presented here also add to the recent developments in the advantages of representing signals/vectors using tight frames for ﬁnite-dimensional Hilbert spaces. This article unveils as follows: In Section 2 we formally introduce our problem of ﬁnding an optimal dictionary which oﬀers least square representation. Section 2 is the heart of this article, where we solve the problem of ﬁnding an ℓ2-optimal dictionary, and arrive at an explicit solution. Algorithms to construct ℓ2-optimal dictionaries are given in Section 4, where we present the proofs of our main results. The case of representing random vectors distributed uniformly on the unit sphere is treated in Subsection 2.4; we demonstrate that the ℓ2-optimal dictionaries in this case are ﬁnite tight frames. The intermediate Section 3 contains results related to rank-1 decomposition of positive semideﬁnite matrices; these constitute essential tools for the solutions of our main results. We conclude in Section 5 with a summary of this work and future directions.

We employ standard notations in this article. As usual, is the standard Euclidean norm. The n ˆ n identity and m ˆ n zero matrices are denoted by In and Omˆn, respectively. For a matrix M we let trp Mq and M denote its trace and Moore-Penrose pseudo-inverse, respectively. The set of n ˆ n symmetric and positive (semi-)deﬁnite matrices with real entries is denoted by Snˆn (Snˆn ), and the set of n ˆ n symmetric matrices with real entries is denoted by Snˆn. For a Borel probability measure µ deﬁned on Rn, we let Eµr s denote the corresponding mathematical expectation. The image of a map f is written as imagepfq. The gradient of a continuously diﬀerentiable function f is denoted by f. For ﬁnite ordered sets A and B, we let A Z B denote the ordered set consisting of the elements (in their order) of A followed by the elements (in their order) of B; for instance, if A p1, 2q and B p 5, 7q, then A Z B p1, 2, 5, 7q. Suppose that A and B are two ordered sets such that B Ă A as sets, then Az B is the ordered sub-collection in A after deleting the elements of the set B. Finally, given an ordered collection of vectors pxiqn i 1 in Rν with ν ě n and equipped with the standard inner product, Ortho pxiqn i 1 gives the result of Gram-Schmidt orthonormalization of the collection pxiqn i 1 considered in the order in which they appear i.e., x1, x2, . . . , xn.

2. The ℓ2-optimal dictionary problem and its solution

Let V denote an Rn-valued random vector deﬁned on some probability space, and having distribution (i.e., Borel probability measure,) µ. We assume that V has ﬁnite variance. Let

1. By equivalence of two optimization problems we mean that an optimal solution to either of the problems can be obtained from an optimal solution to the other problem.

Mohammed Rayyan Sheriff and Debasish Chatterjee

RV denote the support of µ,2 and let XV be the smallest subspace of Rn containing RV . Our goal is to represent the instances/samples of V with the help of a dictionary of vectors:

DK : di P Rn ˇˇ di 1 for i 1, . . . , K ( with a given K ě n,

in an optimal fashion. A representation of an instance v of the random vector V is given by the coeﬃcient vector α pα1 . . . αKq J, such that

i 1 αidi. (4)

A reconstruction of the sample v from the representation α is carried out by taking the linear combination řK i 1 αidi. We deﬁne the cost associated with representing v in terms of the coeﬃcient vector α as řK i 1 α2 i . Since the dictionary vectors tdiu K i 1 must be able to represent any sample of V , the property that spantdiu K i 1 Ą RV is essential. A dictionary DK tdiu K i 1 Ă Rn is said to be feasible if spantdiu K i 1 Ą RV . We denote by DK the set of all feasible dictionaries. For a feasible dictionary DK tdiu K i 1, with m : dim spantdiu K i 1 , and for any v P RV , the linear equation (4) is satisﬁed by inﬁnitely many values of α whenever K ą m. In fact, the solution set of (4) constitutes a p K mq-dimensional aﬃne subspace of RK. Therefore, in order to represent a given v uniquely, one must deﬁne a mechanism of selecting a particular point from this aﬃne subspace, thus making the coeﬃcient vector α pα1 . . . αKq J a function of v. Let f denote such a function; to wit, fpvq : α is the coeﬃcient vector used to represent the sample v. We call such a map RV Q v ÞÝÑ fpvq P RK a scheme of representation. Representation of samples of the random vector V using a dictionary DK and a scheme f is said to be proper if any vector v P RV can be uniquely represented and then exactly reconstructed back. It is clear that for proper representation of V with a dictionary DK consisting of vectors tdiu K i 1, the mapping RV Q v ÞÝÑ fpvq P RK should be an injection that satisﬁes

V d1 d2 d K fp V q µ-almost surely. (5)

A scheme f of representation is said to be feasible if for some feasible dictionary DK : tdiu K i 1 P DK the equality d1 d2 d K fp V q V is satisﬁed almost surely. We denote by F the set of all feasible schemes of representation. Given a scheme f of representation, the (random) cost associated with representing V is given by fp V q 2. The problem of ﬁnding an ℓ2-optimal dictionary can now be posed as:

Find a pair consisting of a dictionary D K P DK and a feasible scheme f of representation such that the average cost Eµ f p V q 2 of representation is minimal.

Here the subscript µ indicates the distribution of random vector V with respect to which the expectation is evaluated. In other words, we have the following optimization problem:

minimize DK,f Eµ fp V q 2

# DK P DK, f P F.

2. Recall (Parthasarathy, 2005, Theorem 2.1, Deﬁnition 2.1, pp. 27-28) that the support of µ is the set of points z P Rn such that the µ-measure of every open neighbourhood of z is positive.

Optimal Dictionary for Least Squares Representation

The problem given in (6) will be referred to as the ℓ2-optimal dictionary problem. It should be noted that the ℓ2-optimal dictionary problem is non-convex due to the constraint that the dictionary vectors tdiu K i 1 of a feasible dictionary must be of unit length. Even if we change this constraint to t di ď 1u from t di 1u, which makes the feasible region of dictionary vectors convex, the set of feasible schemes of representation is not known to be a convex set a priori. In this article we solve the ℓ2-optimal dictionary problem given in (6) in two steps: (Step I) We let XV Rn. (Step II) We let XV be any proper nontrivial subspace of Rn.3

The remainder of this section is devoted to describing Steps I and II by exposing our main results, followed by discussions, a numerical example, and a treatment of the important case of the uniform distribution on the unit sphere of Rn.

2.1 Step I: XV Rn

If XV Rn, a dictionary of vectors DK tdiu K i 1 Ă Rn is feasible if and only if di 1 for all i 1, . . . , K, and spantdiu K i 1 Rn. Thus, the ℓ2-optimization problem (6) reduces to: minimize tdiu K i 1,f Eµ fp V q 2

di 1 for all i 1, . . . , K, spantdiu K i 1 Rn, d1 d2 d K fp V q V µ-almost surely.

Let ΣV : Eµr V V Js. We claim that ΣV is positive deﬁnite. Indeed, if not, then there exists a nonzero vector x P Rn such that x JV 0 almost surely, which contradicts the assumption that XV Rn. Existence and characterization of the optimal solutions to (7) is done by the following:

Theorem 1. Consider the optimization problem (7), and let ΣV : Eµ V V J . (7) admits an optimal solution.

The optimal value corresponding to (7) is

trpΣ1{2 V q 2

Optimal solutions of (7) are characterized by:

Ź a dictionary D K td i u K i 1 that is feasible for (7) and that satisﬁes

i 1 d i d i J M : K

tr Σ1{2 V Σ1{2 V , (8)

and Ź a scheme f D Kpvq : d 1 d 2 d K v. Moreover, all optimal dictionary-scheme pairs can be obtained via the procedure described in Algorithm 2 on p. 22.

3. The trivial case of XV t0u is discarded because then there is nothing to prove; we therefore limit ourselves to nontrivial subspaces of Rn.

Mohammed Rayyan Sheriff and Debasish Chatterjee

2.2 Step II: XV is a strict nontrivial subspace of Rn

Let XV be any proper nontrivial subspace of Rn. In this situation it is reasonable to expect that no optimal dictionary that solves (6) contains elements that do not belong to XV . That this indeed happens is the assertion of the following Lemma, whose proof is provided in Section 4:

Lemma 2. Optimal solutions, if any exists, of problem (6) are such that the optimal dictionary vectors td i u K i 1 satisfy d i P XV for all i 1, . . . , K.

Lemma 2 guarantees that if the problem (6) admits a solution, then the corresponding optimal dictionary vectors must be elements of XV . This means that it is enough to optimize over dictionaries with their elements in XV instead of the whole of Rn. Therefore, the constraint spantdiu K i 1 Ą RV can be equivalently stated as spantdiu K i 1 XV . Let the dimension of XV be m with m ă n, and let B tbium i 1 be a basis for XV . It should be noted that XV imagepΣV q, and therefore, a basis of XV can be obtained by computing a basis of the subspace imagepΣV q. An example of such a basis of XV is the collection of unit eigenvectors of ΣV corresponding to its non-zero eigenvalues. Fix a basis B tbium i 1 of XV . Let B be a matrix containing the vectors tbium i 1 as its columns: B : b1 b2 bm .

If δi is the representation of the dictionary vector di in the basis B, i.e., di Bδi, then the constraints on the family tdiu K i 1 get transformed to the following ones on tδiu K i 1: di 2 1 ñ δJ i BJB δi 1, and spantdiu K i 1 Ą RV ñ spantdiu K i 1 XV ñ spantδiu K i 1 Rm. We deﬁne the random vector

VX : p BJBq 1BJ V.

Then VX is an Rm valued random vector which is the representation of random vector V in the basis B. For every scheme f that is feasible for (6), let us deﬁne an associated scheme for representing samples of the random vector VX by

Rm Q v ÞÝÑ f Xpvq : fp Bvq P RK.

The conditions on feasibility of f in (6) imply that the scheme f X is feasible if for a feasible dictionary of vectors tδiu K i 1, δ1 δ2 δK f Xp VXq VX µ-almost surely.

In other words, in contrast to the problem (6), where the optimization is carried out over vectors in Rn, we can equivalently consider the same problem in Rm, but with the following modiﬁed constraints:

minimize tδiu K i 1,f X Eµ f Xp VXq 2

δJ i BJB δi 1 for all i 1, . . . , K, spantδiu K i 1 Rm, δ1 δ2 δK f Xp VXq VX µ-almost surely.

Optimal Dictionary for Least Squares Representation

In relation to the problem (9) let us deﬁne the following quantities $ &

ΣV : Eµr V V Js

Σ : p BJBq 1{2 BJΣV B p BJBq 1{2

H : K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 .

Since the support of VX is m-dimensional, we conclude from previous discussion that ΣVX : Eµ VXV J X is positive deﬁnite. Since Σ p BJBq1{2ΣVXp BJBq1{2, it follows that Σ is positive deﬁnite, which in turn implies that H is positive deﬁnite. To summarize, an ℓ2-optimal dictionary-scheme pair that solves the optimization problem (6) is equivalently obtained from an optimal solution of the problem (9), and is characterized by the following:

Theorem 3. Consider the optimization problem (9). (9) admits an optimal solution.

The optimal value corresponding to (9) is

Optimal solutions of (9) are characterized by:

Ź a dictionary D K tδ i u K i 1 that is feasible for (9) and that satisﬁes

i 1 δ i δ i J H , (11)

and Ź a scheme f Xpuq : δ 1 δ 2 δ K u. Consequently, an optimal solution of the ℓ2-optimal dictionary problem (6) consisting of an ℓ2-optimal dictionary-scheme pair is given by A collection of vectors td i u K i 1 deﬁned as d i : Bδ i for i 1, 2, . . . , K, and the scheme f pvq : d 1 d 2 d K v. Moreover, all optimal dictionary-scheme pairs can be obtained via the procedure given in Algorithm 3 on p. 26.

2.3 Discussion and a numerical example

Remark 4. The problem (6) does not a priori hypothesize an aﬃne/linear structure of candidate schemes. The fact that linear schemes are optimal in (6) is one of the crucial assertions of both Theorem 1 and Theorem 3.

Remark 5. Algorithmic computation of an ℓ2-optimal dictionary relies on the second moment ΣV of the random vector V . Complete knowledge of the distribution µ is, therefore, unnecessary. This is an advantage since in practical situations, learning/estimating ΣV from data is comparatively less demanding than getting a description of the distribution µ itself.

Remark 6. Let M P Snˆn be such that imagep Mq XV , let B tbium i 1 be a basis for XV evaluated as a basis for imagep Mq. Let

B : b1 b2 bm

Mohammed Rayyan Sheriff and Debasish Chatterjee

Σp Mq : p BJBq 1{2 BJMB p BJBq 1{2

tr Σp Mq 1{2 p BJBq 1{2 Σp Mq 1{2p BJBq 1{2 .

Suppose that tdiu K i 1 and fp q are the dictionary and the scheme obtained using the procedure given in Algorithm 3 using M and K as inputs. By simplifying the pseudo-inverse d1 d2 d K in fp q, the average cost Jp Mq of representing V using the scheme fp q turns out to be

Jp Mq Eµ V JBp BJBq 1 Hp Mq 1p BJBq 1BJV ı

tr Hp Mq 1p BJBq 1BJΣV Bp BJBq 1

tr Hp Mq 1p BJBq 1{2 Σ p BJBq 1{2

tr p BJBq 1{2 Hp Mq 1p BJBq 1{2 Σ

K tr Σp Mq 1{2 tr Σp Mq 1{2Σ .

Let S : T P Snˆn ˇˇ imagep Tq XV ( . Since the sequence of maps

S Q T ÞÝÑ Σp Tq P Smˆm ,

Smˆm Q T ÞÝÑ T 1{2 P Smˆm ,

Smˆm Q T ÞÝÑ T 1Σ P Smˆm ,

Smˆm Q T ÞÝÑ trp Tq P R,

are, evidently, continuous, it follows at once that the map S Q M ÞÝÑ Jp Mq P R is also continuous. If pΣV denotes the estimated second moment of V , and the estimation is carried out with a large enough number of samples of V , with probability one we have imageppΣV q XV . Therefore, by continuity of M ÞÝÑ Jp Mq, we see at once that

JppΣV q ÝÝÝÝÝÝÑ pΣV ÝÑΣV JpΣV q

Remark 7. The optimal average cost of representation of a random vector V is inversely proportional to the size K of the optimal dictionary, as is evident from the optimal costs in Theorems 1 and 3. To wit, the optimal average cost of representation decreases monotonically with K, which is expected.

Remark 8. ℓ2-optimal dictionaries for representing a random vector V are also optimal for representing any scalar multiple αV of V for any 0 α P R. Indeed, it is clear that H

deﬁned in (10) is invariant under nonzero scalar multiplications of V . Therefore, ℓ2-optimal dictionaries are also invariant under nonzero scalar multiplications of the random vector V . This fact also follows from the observation made in Remark 4.

Remark 9. An ℓ2-optimal dictionary as characterized by Theorem 3 appears there in the form of what is known as a rank-1 decomposition of the positive deﬁnite matrix H . Elements

Optimal Dictionary for Least Squares Representation

of the theory of rank-1 decompositions of positive deﬁnite matrices is discussed below in Section 3. This particular decomposition plays a crucial rôle in transforming the search space of the ℓ2-optimal dictionary problem (7) from the set of dictionaries to the set of symmetric positive deﬁnite matrices with real entries, and translating the non-convex ℓ2optimal dictionary problem into a tractable convex one.

Remark 10. All ℓ2-optimal dictionaries are unique upto rank-1 decompositions of a unique positive deﬁnite matrix that is obtained from the second moment Er V V Js of the random vector V . That is, for a given random vector whose samples are to be optimally represented, every ℓ2-optimal dictionary is obtained from a rank-1 decomposition of a unique positive deﬁnite matrix.

Remark 11. Looking ahead at Algorithm 3, it becomes evident that non-uniqueness of optimal dictionaries can be attributed to the non-uniqueness in the selection of C in Step 5 of Algorithm 3, and the element of choice associated to the selection of pj and pk in Step 2 of Algorithm 1. The number of optimal solutions may be inﬁnite depending on the distribution of the random vector V . For instance, if V is uniformly distributed over the unit sphere of Rn and K n, then the elements in an ℓ2-optimal dictionary form an orthonormal basis of Rn. (The special case of uniform distribution of V over spheres is discussed in Section 2.4.) Of course, there are inﬁnitely many orthonormal bases of Rn for n ě 2.

Remark 12. From Algorithm 3 on p. 26 we can infer that by calculating the matrix B there, consisting of the eigenvectors of ΣV corresponding to its non-zero eigenvalues, the computations of p BJBq 1{2, Σ1{2, and C in the decomposition given in Step 5 become straightforward. Therefore, the chief computational load in Algorithm 3 consists of eigendecomposition of ΣV and that in Algorithm 1 (in Step 6), both of which can be performed in polynomial time.

Example 1. Let V ˆV1 V2

be a random vector taking values in R2, with V1 and V2 being

independent random variables. Let the density functions of V1 and V2 be

ρV1pvq 2pv 1q1r1,2spvq and ρV2pvq 2p2 vq1r1,2spvq,

respectively. The support of V is, therefore, the square r1, 2s ˆ r1, 2s. Elementary calcu-

lations lead to ΣV : Eρr V V Js ˆ17{6 20{9 20{9 11{6

. We employed the procedure described

in Algorithm 2 for the given matrix ΣV and K 3 in matlab. An optimal dictionary ty 1, y 2, y 3u was obtained, with

y 1 ˆ0.9789 0.2045

, y 2 ˆ0.6792 0.7339

, y 3 ˆ0.5870 0.8096

the optimum value of the objective function was reported to be 1.8930. This collection ty i u3 i 1 of optimal vectors are marked with crosses on the circumference of the unit circle shown in Figure 2. A second optimal dictionary tz 1, z 2, z 3u was obtained, also using Algorithm 2, with dictionary vectors

z 1 ˆ0.4214 0.9069

, z 2 ˆ0.9284 0.3717

, z 3 ˆ0.8513 0.5247

Mohammed Rayyan Sheriff and Debasish Chatterjee

with an identical optimal value as in the former case. The vectors tz i u3 i 1 are marked with dark circles on the circumference of the unit circle in Figure 2.

Figure 2: The two optimal dictionaries in Example 1.

It is expected that the optimal dictionary vectors are concentrated towards the bottom right corner of the support r1, 2s ˆ r1, 2s (the region with strong shading in ﬁgure 2). In the optimal solution tz i u3 i 1, two vectors z 2 and z 3 point to the region where density of V is concentrated the most. Also, for the solution ty i u3 i 1, two vectors y 2 and y 3 are oriented towards the center of the square r1, 2s ˆ r1, 2s, with the remaining vector pointing towards the region of higher density. These results correlate positively with what may be expected out of ℓ2-optimal dictionaries.

2.4 Uniform distribution over the unit sphere

We shall test our results on the important case of µ being the uniform distribution on the unit sphere. Note that due to (rigid) rotational symmetry of the distribution, it follows that rigid rotations of optimal dictionaries in this case are also optimal. Let us consider a dictionary consisting of (unit) vectors that are close to each other, i.e., the inner product between any two elements of the dictionary is close to 1. It is quite evident that such a dictionary is not optimal for representing uniformly distributed samples due to the fact that samples of V that are almost orthogonal to the dictionary vectors carry equal priority as any other vector but require large coeﬃcients for their representation. It is, therefore, more natural to search for dictionaries in which the constituent vectors are maximally spaced out . Several examples of collections of vectors that are maximally spaced out may be found in (Benedetto and Fickus, 2003, Section 4). Collections of vectors that are maximally far apart from each other are known to attain equilibria under the actions of diﬀerent kinds of forces deﬁned and explained in (Benedetto and Fickus, 2003, Section 4) and (Saﬀand Kuijlaars, 1997, p. 6). Such collections of vectors are generalized by the ideas of tight frames as explained in (Benedetto and Fickus, 2003); see also (Christensen, 2016; Daubechies et al., 1986; Benedetto and Fickus, 2003; Zimmermann, 2001) for related information.

Optimal Dictionary for Least Squares Representation

We recall here some standard deﬁnitions for completeness and to provide the necessary substratum for our next result. Let n, K be positive integers such that K ě n. We say that a collection of vectors txiu K i 1 is a frame for Rn if there exist some constants c, C ą 0 such that

i 1 xxi, xy2 ď C x 2 for all x P Rn.

We say that a frame txiu K i 1 Ă Rn is tight if c C. In addition, if txiu K i 1 Ă Rn is a tight frame and xi 1 for all i 1, 2, . . . , K, we say that the collection txiu K i 1 is a c-unit norm tight frame (a c-UNTF). We have the following connection between ℓ2-optimal dictionaries and UNTFs:

Proposition 13. A dictionary DK tdiu K i 1 is optimal for representing samples of a random vector V that is uniformly distributed over the surface of the unit sphere of Rn if and only if the collection tdiu K i 1 of vectors constitute a K

Proof If V is uniformly distributed over the unit sphere, we have ΣV Er V V Js 1

n In. According to Theorem 1 the collection tdiu K i 1 is an optimal dictionary if and only if

i 1 did J i K tr 1 ?n In 1 ?n In K

Since the family tdiu K i 1 must span Rn by deﬁnition, it is a frame. The frame operator for the frame tdiu K i 1 is given by (Benedetto and Fickus, 2003, Section 2)

Rn Q y ÞÝÑ Spyq :

i 1 xdi, yy di ˆ K ÿ

i 1 did J i

where xv, wy v Jw is the standard inner product in Rn. (Benedetto and Fickus, 2003, Theorem 3.1) asserts that a collection of unit norm vectors tdiu K i 1 forms a tight frame in Rn if and only if the collection is a K

n -UNTF. From (Benedetto and Fickus, 2003, Theorem 2.1) it follows that a collection of vectors tdiu K i 1 is a K

n -UNTF if and only if

i 1 did J i K

The assertion follows from (13) and (14).

3. A particular class of rank-1 decompositions of matrices

We collect and establish here some results on the theory of rank-1 decompositions of matrices. While these facts will be needed for our main results, they are also of independent interest. A standard result in matrix theory (Bhatia, 2009, p. 2) states that a symmetric positive semideﬁnite matrix with real entries M P Snˆn , can be decomposed as Y Y J for some

Mohammed Rayyan Sheriff and Debasish Chatterjee

Y P Rnˆr, where r : rankp Mq. Let yi indicate the i th column of the matrix Y . Then the equality M Y Y J is equivalent to

i 1 yiy J i .

More generally for K ě r, let

M : ˆ M Onˆp K rq Op K rqˆn IK r

where O is a zero matrix of order n ˆ p K rq. If we consider the decomposition of M as M Y Y J with Y P Rpn K rqˆK, and indicate by Y the upper n ˆ K matrix block of Y , we get M Y Y J. In other words

i 1 yiy J i . (15)

There are numerous ways of decomposing positive semideﬁnite matrices; some of them are discussed in (Zhang, 2011, Theorem 7.3). The speciality of a particular decomposition lies in the characteristics exhibited by the vectors yi s. A particular rank-1 decomposition which we will use to solve the ℓ2-optimal dictionary problem is the one where for every M P Snˆn and K ě r : rankp Mq there exists a collection of vectors tyiu K i 1 Ă Rn that satisfy

i 1 yiy J i and y J i yi trp Mq

K for all i 1, . . . , K. (16)

We are now in a position to present Algorithm 1 and its associated Theorem 14, whose corollaries will give us the needed rank-1 decomposition of (16). We mention that Algorithm 1 is, in principle, similar to Procedure 1 of (Sturm and Zhang, 2003), and in particular, the assertions of Theorem 14 and its corollaries can be obtained by applying (Sturm and Zhang, 2003, Proposition 3 and Corollary 4) via some straightforward modiﬁcations. However, we provide the complete proofs here for the sake of completeness.

Theorem 14. For any matrix Λ P Rnˆn there exists an orthonormal collection pxiqn i 1 Ă Rn

of vectors satisfying

x J i Λxi trpΛq

n for all i 1, . . . , n,

Moreover, such a collection can be obtained from Algorithm 1.

Proof First we establish that the collection of vectors pxiqn 1 i 1 contained in Sn 1 (recall that Sn 1 is generated in the for loop in the Algorithm 1,) are orthonormal, and satisfy x J i Λxi trpΛq

n for i 1, . . . , n 1. We shall prove this by induction on i. The induction base: For i 1, we have P1 pe1, e2, . . . , enq. Since řn m 1 e J mΛem trpΛq, vectors pj, pk P P1 exist such that p J j Λpj ď trpΛq

n ď p J k Λpk. We solve for θ in the equation

gpj;pkpθq :

p1 θqpj θpkq JΛpp1 θqpj θpk

pp1 θq2 θ2q trpΛq

Optimal Dictionary for Least Squares Representation

Algorithm 1: Calculation of orthonormal bases à la Theorem 14

Input: A matrix Λ P Rnˆn. Output: An orthonormal collection of vectors pxiqn i 1 Ă Rn such that x J i Λxi trpΛq

n for all i 1, . . . , n.

1 Initialize quantities by S0 H, i 1.

2 for i from 1 to pn 1q

3 do S1 i Si 1 Z pe1, e2, . . . , enq. Pi Orthop S1 iqz Si 1. Find pj, pk P Pi such that p J j Λpj ď trpΛq

n ď p J k Λpk. Let Θ P r0, 1s be a solution of the equation (in θ)

p1 θqpj θpk JΛ p1 θqpj θpk trpΛq

Deﬁne xi : p1 Θqpj Θpk pp1 Θq2 Θ2q1{2 .

Deﬁne Si : Si 1 Z pxiq.

4 end for loop

5 S1 n Sn 1 Z pe1, e2, . . . , enq.

6 Output Sn : Orthop S1 nq.

We know that a solution exists in r0, 1s because for θ 0 we have

gpj;pkp0q pp1 θqpj θpkq JΛpp1 θqpj θpkq

pp1 θq2 θ2q

θ 0 p J j Λpj ď trpΛq

for θ 1 we have

gpj;pkp1q pp1 θqpj θpkq JΛpp1 θqpj θpkq

pp1 θq2 θ2q

θ 1 p J k Λpk ě trpΛq

and gpj;pkp q is a continuous function of θ. Let Θ be such a solution. Then, following the notation in Algorithm 1, we have

x1 : p1 Θqpj Θpk a

p1 Θq2 Θ2 .

Since pj, pk are elements of P1, they are orthonormal; therefore,

p1 Θq2 pj 2 Θ2 pk 2

p1 Θq2 Θ2 1,

and since Θ is a solution of equation (17) we have

x J 1 Λx1 trpΛq

Mohammed Rayyan Sheriff and Debasish Chatterjee

Induction hypothesis: Assume that for some i between 1 and n 1 the collection Si pxℓqi ℓ 1 is orthonormal, and satisﬁes

x J ℓΛxℓ trpΛq

n for all ℓ 1, . . . , i.

Induction step: In view of the induction hypothesis, we deﬁne

S1 i 1 : Si Z pe1, e2, . . . , enq px1, x2, . . . xi, e1, e2, . . . , enq,

and compute

Orthop S1 i 1q px1, x2, . . . , xi, p1, p2, . . . , pn iq,

Pi 1 pp1, p2, . . . , pn iq

as in Algorithm 1. Since the collection pxℓqi ℓ 1 Z ppℓqn i ℓ 1 is an orthonormal basis for Rn, we have iÿ

ℓ 1 x J ℓΛxℓ

ℓ 1 p J ℓΛpℓ trpΛq,

leading to n i ÿ

ℓ 1 p J ℓΛpℓ pn iq

Thus, there exist vectors pj, pk P Pi 1 such that p J j Λpj ď trpΛq

n ď p J k Λpk. Let us consider the equation

gpj,pkpθq : pp1 θqpj θpkq JΛpp1 θqpj θpkq

pp1 θq2 θ2q trpΛq

in θ. From arguments given in the case of i 1, we know that a solution Θ of (18) exists on r0, 1s. We deﬁne

xi 1 : p1 Θqpj Θpk a

p1 Θq2 Θ2 .

Since pj, pk are orthogonal to the vectors pxℓqi ℓ 1, so is any linear combination of pj, pk. Therefore, xi 1 is orthogonal to the vectors pxℓqi ℓ 1, which, along with the fact that

p1 Θq2 pj 2 Θ2 pk 2

p1 Θq2 Θ2 1,

makes the collection pxℓqi 1 ℓ 1 orthonormal. Also, since Θ is a solution of (18), we get

x J i 1Λxi 1 trpΛq

Therefore, by mathematical induction, we conclude that the collection pxiqn 1 i 1 contained in Sn 1 has the required properties.

Optimal Dictionary for Least Squares Representation

Finally, in the 4th and 5th steps of Algorithm 1, we get

S1 n px1, x2, . . . , xn 1, e1, e2, . . . , enq,

and Orthop S1 nq px1, x2, . . . , xn 1, xnq.

By construction, pxℓqn ℓ 1 is an orthonormal collection, implying that řn i 1 x J i Λxi trpΛq. In turn, this leads to

i 1 x J i Λxi

i 1 x J i Λxi

Thus, Algorithm 1 yields a collection of orthonormal vectors pxiqn i 1 such that

x J i Λxi trpΛq

n for all i 1, 2, . . . , n,

thereby completing the proof.

Corollary 15 (Rank-1 decomposition). Let X P Snˆn , deﬁne r : rankp Xq, and let T P Snˆn. There exists a collection of vectors txiur i 1 Ă Rn such that

j 1 xjx J j , and x J i Txi 1

r trp XTq for all i 1, . . . , r.

Proof We know (Bhatia, 2009, p. 2) that any symmetric positive semideﬁnite matrix X with real entries and of rank r can be decomposed as CCJ where C P Rnˆr. Let us deﬁne Λ P Rrˆr as Λ : CJTC. According to Theorem 14 a collection of orthonormal vectors tyiur i 1 Ă Rr can be obtained such that

y J i CJT Cyi y J i Λyi trpΛq

We deﬁne a collection txiur i 1 Ă Rr by xi : Cyi for i 1, . . . , r. Then

i 1 xix J i C ˆ rÿ

i 1 yiy J i

CJ CIr CJ X.

Moreover, for every i 1, . . . , r,

x J i Txi y J i CJTCyi trpΛq

r trp CJTCq 1

The assertion follows.

Corollary 15 is generalized slightly by the following one; we shall employ this particular form to solve the ℓ2-optimal dictionary problem in Theorem 1.

Mohammed Rayyan Sheriff and Debasish Chatterjee

Corollary 16. Let M P Snˆn and deﬁne r : rankp Mq. Let A P Snˆn and K ě r be given. There exists a collection of vectors tyiu K i 1 Ă Rn such that

j 1 yjy J j , and y J i Ayi 1

K trp MAq for all i 1, . . . , K. (19)

Proof Let us consider the square matrices X, T of order K n r in Corollary 15 to be

X : ˆ M Onˆp K rq Op K rqˆn IK r

and T : ˆ A Onˆp K rq Op K rqˆn Op K rqˆp K rq

Then rankp Xq K by construction. Therefore, vectors txiu K i 1 Ă Rn K r exist satisfying the properties in Corollary 15. Let us denote Rn Q yi : xi1 . . . xin J for i 1, . . . , K; in other words, yi is the vector formed by the ﬁrst n components of xi. Then

i 1 yiy J i M,

and for any i 1, . . . , K,

y J i Ayi x J i Txi 1

K trp XTq 1

The assertion follows at once.

4. Proofs of Theorem 1, Lemma 2, and Theorem 3

4.1 Proof of Theorem 1

Proof For a given dictionary DK P DK of vectors tdiu K i 1 that is feasible for (7), let us deﬁne a scheme of representation

Rn Q v ÞÝÑ f DKpvq : d1 d2 d K v P RK.

Quite clearly, d1 d2 d K f DKpvq v for any v P Rn by the deﬁnition of the

pseudo-inverse because if spantdiu K i 1 Rn, then d1 d2 d K v solves the equation d1 d2 d K x v. Therefore,

d1 d2 d K f DKp V q V µ-almost surely.

We know that f DKpvq d1 d2 d K v is the solution of the least squares problem

minimize x PRK x 2

subject to d1 d2 d K x v.

Optimal Dictionary for Least Squares Representation

Therefore, for an arbitrary f P F such that d1 d2 d K fpvq v for all v P Rn, we must have f DKpvq 2 ď fpvq 2 for all v P Rn.

Therefore, f DKp V q 2 ď fp V q 2 µ-almost surely,

and hence, Eµ f DKp V q 2 ď Eµ fp V q 2 .

Minimizing over all feasible dictionaries and schemes, we get

inf DKPDK Eµ f DKp V q 2 ď inf DKPDK, f PF Eµ fp V q 2 (20)

The problem on the left-hand side of the inequality (20) is

minimize tdiu K i 1 Eµ f DKp V q 2

# di 1 for all i 1, . . . , K, spantdiu K i 1 Rn.

From (20) we can conclude that the optimal value, if it exists, of problem (7) is bounded below by the optimal value, if it exists, of the one given in (21). Our strategy is to demonstrate that optimization problem (21) admits a solution, and we shall furnish a feasible solution of (7) that achieves a value of the objective function that is equal to the optimal value of the problem (21). This will solve (7). Let D : d1 d2 d K . The objective function in (21) can be computed as

Eµ f DKp V q 2 Eµ D V 2

Eµ V Jp D q JD V

Eµ V J DJp DDJq 1 J DJp DDJq 1 V

Eµ V Jp DDJq 1DDJp DDJq 1V

Eµ V Jp DDJq 1V

Eµ trp V Jp DDJq 1V q

Eµ trp V V Jp DDJq 1q

tr Eµ V V J p DDJq 1 .

Letting ΣV : Eµ V V J and writing DDJ řK i 1 did J i the optimization problem (21) is rephrased as

minimize tdiu K i 1 tr ˆ ΣV

i 1 did J i

# di 1 for all i 1, . . . , K, spantdiu K i 1 Rn.

Mohammed Rayyan Sheriff and Debasish Chatterjee

Let S be the feasible set for the problem in (22). At ﬁrst (22) appears to be non-convex. Let us demonstrate that the objective function of (22) is convex in DDJ. We know that whenever ΣV is a positive deﬁnite matrix,

trpΣV M 1q tr Σ1{2 V M 1Σ1{2 V tr Σ 1{2 V MΣ 1{2 V 1 .

From (Bhatia, 1997, p. 113 and Exercise V.1.15, p. 117) we know that inversion of a matrix is a matrix convex map on the set of positive deﬁnite matrices. Therefore, for any θ P r0, 1s and M1, M2 P Snˆn we have

Σ 1{2 V p1 θq M1 θM2 Σ 1{2 V 1

p1 θq Σ 1{2 V M1Σ 1{2 V θ Σ 1{2 V M2Σ 1{2 V 1

ĺ p1 θq Σ 1{2 V M1Σ 1{2 V 1 θ Σ 1{2 V M2Σ 1{2 V 1 , (23)

where A ĺ B implies that B A is positive semideﬁnite. Since trp q is a linear functional over the set of n ˆ n matrices we have

tr ΣV p1 θq M1 θM2 1 tr Σ 1{2 V p1 θq M1 θM2 Σ 1{2 V 1

ď p1 θq tr Σ 1{2 V M1Σ 1{2 V 1 θ tr Σ 1{2 V M2Σ 1{2 V 1

ď p1 θq trpΣV M 1 1 q θ trpΣV M 1 2 q.

In other words, the function M ÞÝÑ trpΣV M 1q is a convex function on the set of symmetric and positive deﬁnite matrices. Moreover, we know that for a collection tdiu K i 1 that is feasible for (22),

DK Q tdiu K i 1 ÞÝÑ hpd1, . . . , d Kq :

i 1 did J i

maps into the set of positive deﬁnite matrices. Therefore, the objective function in (22) is a convex function on imagephq. This allows us to translate the feasible set of the optimization problem (22) to the set of matrices M formed by all feasible collections tdiu K i 1, i.e., on hp DKq. Let R : M P Snˆn ˇˇ trp Mq K ( . On the one hand, from Corollary 16 with A In, we know that any symmetric and positive deﬁnite matrix M P R can be decomposed as

i 1 did J i with di

K 1 for all i 1, . . . , K.

The fact that M is positive deﬁnite implies that spantdiu K i 1 Rn. Therefore, tdiu K i 1 P DK and M hpd1, . . . , d Kq, which implies that

R Ă hp Sq. (24)

Optimal Dictionary for Least Squares Representation

On the other hand, for any collection of vectors tdiu K i 1 P DK, we have hpd1, . . . , d Kq řK i 1 did J i P Snˆn and tr hpd1, . . . , d Kq řK i 1 d J i di K. Therefore, by deﬁnition of R,

hp Sq Ă R. (25)

From (24) and (25) we conclude that hp DKq R. The optimization problem (22) is, therefore, equivalent to the one where the feasible set is the set of positive deﬁnite matrices with trace K, i.e., from (22),

minimize MPSnˆn tr ΣV M 1

subject to trp Mq K 0. (26)

The optimization problem in (26) is convex since its objective function is convex (as a function of M) and the feasible region is the intersection of a convex cone Snˆn and the aﬃne space M P Rnˆn ˇˇ trp Mq K 0 ( . In the light of (Boyd and Vandenberghe, 2004, p. 244) it follows that (26) can be solved by considering just the ﬁrst order optimality conditions. These ﬁrst order optimality conditions are expressed in terms of a Lagrangian

Lp M, γq : trp M 1ΣV q γ trp Mq K ,

containing a KKT multiplier γ at an optimal point M as

0 MLp M , γq M trp M 1ΣV q γ trp Mq K ˇˇˇˇ M M

p M q 1ΣV p M q 1 J γIn.

But since M , ΣV P Snˆn , by symmetry it follows that p M q 1ΣV p M q 1 γIn, leading to

ΣV γp M q2. (28)

Since ΣV Onˆn, we get γ 0, and write M as

M 1 ?γ Σ1{2 V .

To evaluate γ we use the fact that by construction K trp M q 1 ?γ tr Σ1{2 V , which gives

γ ˆtr Σ1{2 V

In other words, the ﬁnal expression of the optimizer M in the problem (26) is

tr Σ1{2 V Σ1{2 V . (29)

It follows that the optimal value of the problem (26) (and therefore of (22)) is

trpΣ1{2 V q 2

K . Therefore, this value must be a lower bound of the optimal value, if it exists, for the problem (7).

Mohammed Rayyan Sheriff and Debasish Chatterjee

Employing Corollary 16 with A In, we decompose M as

i 1 d i d i J with d i 1 for each i 1, . . . , K. (30)

Let us consider the dictionary D K consisting of the vectors td i u K i 1 obtained above. Since XV Rn, the matrices ΣV , Σ1{2 V , and M are of rank n, and therefore, spantd i u K i 1 Rn. Along with the fact that d i 1, we see that the dictionary D K of vectors td i u K i 1 is feasible for the problem (7). Let us deﬁne the scheme

Rn Q v ÞÝÑ f D Kpvq : d 1 d 2 d K v P RK.

It is evident that this scheme f D K is feasible for (7). But then the objective function in (7)

evaluated at DK D K and f f D K must be equal to

trpΣ1{2 V q 2

K . Since this particular

value is also a lower bound for the optimal value of (7), the problem (7) is solvable. An optimal dictionary-scheme pair is given by $ &

D K td i u K i 1 obtained from the decomposition (30), and

Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v P RK. (31)

The proof is now complete.

We provide the Algorithm 2 that computes optimal dictionary-scheme pairs for the case XV Rn. The inputs to the algorithm are the matrix ΣV and the size K of a dictionary:

Algorithm 2: ℓ2-optimal dictionary for the case XV Rn.

Input: A matrix ΣV P Snˆn and a number K ě n. Output: An ℓ2-optimal dictionary-scheme pair td i u K i 1, f .

1 Deﬁne M1 : K tr Σ1{2 V Σ1{2 V .

2 Deﬁne M2 : ˆ M1 Onˆp K nq Op K nqˆn IK n

, A : ˆ In Onˆp K nq Op K nqˆn Op K nqˆp K nq

3 Compute C P RKˆK such that M2 CCJ.

4 Deﬁne Λ P RKˆK by Λ : CJAC, and apply Algorithm 1 to get a collection of vectors txiu K i 1 Ă RK.

5 Deﬁne the collection tviu K i 1 Ă RK by vi : Cxi for i 1, . . . , K.

6 Deﬁne the ℓ2-optimal dictionary td i u K i 1 Ă Rn such that the jth component of d i is given by d i pjq : vipjq for j 1, . . . , n and for every i 1, . . . , K.

7 Deﬁne the optimal scheme Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v.

4.2 Proof of Lemma 2

Proof We argue by contradiction. Suppose that the assertion of the Lemma is false. If we denote by xi the orthogonal projection of di on XV and by yi the orthogonal projection of

Optimal Dictionary for Least Squares Representation

di on the orthogonal complement of XV , we must have xi ă 1 for at least one value of i. If f is an optimal scheme of representation, feasibility of f gives, for any v P RV ,

i 1 difipvq ˆ K ÿ

i 1 xifipvq ˆ K ÿ

i 1 yifipvq

xifipvq 0. (32)

Fix a unit vector x P XV , and deﬁne a dictionary td i u K k 1 by

xi xi if xi 0,

x otherwise.

Then clearly

spantd i u K i 1 Ą spantxiu K i 1 Ą RV and d i 1 for all i 1, . . . , K.

In other words, the dictionary of vectors td i u K i 1 is feasible for the problem (6). Let us now deﬁne a scheme f by

Rn Q v ÞÝÑ f pvq : diagt x1 , x2 , . . . , x K ufpvq P RK.

For any v P RV , using the dictionary consisting of vectors td i u K i 1 we get

i 1 d i f i pvq

i 1 d i xi fipvq

xi xi xi fipvq v, (33)

where the last equality follows from (32). Thus, f p q along with the dictionary of vectors td i u K i 1 is feasible for problem (6). But for any v P RV we have

i 1 xi 2 fipvq 2 ă

fipvq 2 fpvq 2 ,

where the inequality is due to the fact that xi ă 1 for at least one i. This contradicts the assumption that the pair tdiu K i 1 along with the scheme f is optimal for (6).

4.3 Proof of Theorem 3

Proof The problem (9) is similar to problem (7) except for the ﬁrst constraint. In (7) we optimize over vectors taking values on the surface of the unit sphere, whereas in (9) we optimize over vectors taking values on the surface of the ellipsoid x P Rm ˇˇ x Jp BJBqx 1 ( . Following the arguments in the proof of Theorem 1 till (22), one can conclude that the

Mohammed Rayyan Sheriff and Debasish Chatterjee

optimal value, if it exists, of problem (9) is bounded below by the optimal value, if it exists, of the problem

minimize tδiu K i 1 tr ˆ ΣVX

# δJ i p BJBqδi 1 for all i 1, 2, . . . , K, spantδiu K i 1 Rm,

where ΣVX : Eµ VXV J X p BJBq 1BJ Eµ V V J p BJBq 1BJ J. Let us deﬁne: S to be the feasible region of the problem (34), R : H P Smˆm ˇˇ tr Hp BJBq q K ( , and

the map Rm K Q pδ1, δ2, . . . , δKq ÞÝÑ hpδ1, δ2, . . . , δKq : řK i 1 δiδJ i P Smˆm . From Corollary 16 we see that for every H P R there exists a collection of vectors tδiu K i 1 such that K ÿ

i 1 δiδJ i H and δJ i p BJBqδi tr Hp BJBq

which, along with the fact that rankp Hq m ñ spantδiu K i 1 Rm, imply that

R Ă hp Sq. (35)

Moreover, for any collection tδiu K i 1 P S, we have

tr hpδ1, δ2, . . . , δKqp BJBq

i 1 δJ i p BJBqδi K and hpδ1, δ2, . . . , δKq P Smˆm ,

which implies that hp Sq Ă R. (36)

From (35) and (36) we conclude that R hp Sq. In other words, instead of optimizing over the feasible collection of vectors in S in (34), one can equivalently optimize over the set of symmetric positive deﬁnite matrices in R. This consideration leads us to the problem:

minimize H P Smˆm tr ΣVXH 1

subject to tr Hp BJBq K 0. (37)

Letting M : p BJBq1{2Hp BJBq1{2, we write the optimization problem (37) with M as the variable instead of H. Due to this change of variables, the constraint and the objective function become

tr Hp BJBq tr p BJBq1{2Hp BJBq1{2 trp Mq,

and tr ΣVXH 1 tr ΣVXp BJBq1{2M 1p BJBq1{2

tr p BJBq1{2ΣVXp BJBq1{2M 1

Optimal Dictionary for Least Squares Representation

where Σ : p BJBq1{2ΣVXp BJBq1{2

p BJBq1{2 p BJBq 1BJ Eµ V V J p BJBq 1BJ Jp BJBq1{2

p BJBq 1{2 BJΣV B p BJBq 1{2.

Using (38) we write the problem (37) equivalently as:

minimize MPSnˆn tr ΣM 1

subject to trp Mq K 0. (40)

The problem (40) is identical to (26), which implies that the problem (40) is solvable, and an optimizer is

M : K tr Σ1{2 Σ1{2.

Therefore, the problem (37) is solvable, and an optimizer is

H : p BJBq 1{2M p BJBq 1{2

K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 . (41)

From Corollary 16 it follows that there exists a collection tδ i u K i 1 of vectors such that

i 1 δ i δ i J H and δ i J BJB δ i tr H p BJBq

Employing arguments similar to those given in the proof of Theorem 1, we now conclude that the pair the collection of vectors tδ i u K i 1, and the scheme f Xpuq δ 1 δ 2 δ K u, is optimal for the problem (9). Using the optimal solution of (9), we deﬁne a dictionaryscheme pair as: $ &

d i : Bδ i for i 1, . . . , K,

Rn Q v ÞÝÑ f pvq : f X p BJBq 1BJ v δ 1 δ 2 δ K p BJBq 1BJ v.

(42) It is clear that the pair in (42) is feasible for the problem (6), and that the corresponding objective function evaluates to the optimal value of the problem (34). Therefore, along with the assertion of Lemma 2 we can conclude that the problem (6) is solvable, and in fact an

optimal solution is given by (42) with the optimal value of

K . This completes the proof.

As in the case XV Rn, we now provide the Algorithm 3 to obtain an optimal dictionaryscheme pair for the general ℓ2-optimal dictionary problem (6). The algorithm takes the

Mohammed Rayyan Sheriff and Debasish Chatterjee

matrix ΣV and the size of the dictionary K as its inputs. From ΣV we extract a matrix B P Rnˆm containing a set of basis vectors for imagepΣV q in its columns, these vectors form a basis for XV .

Algorithm 3: A procedure to obtain ℓ2-optimal dictionary.

Input: A matrix ΣV P Snˆn and a number K ě m : dimp XV q rankpΣV q. Output: An ℓ2-optimal dictionary-scheme pair ty i u K i 1, f .

1 Compute a basis tbium i 1 for imagepΣV q and deﬁne B : b1 b2 bm .

2 Deﬁne Σ : p BJBq 1{2 BJΣV B p BJBq 1{2.

3 Compute H : K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 .

4 Deﬁne M : ˆ H Omˆp K mq Op K mqˆm IK m

, A : ˆ BJB Omˆp K mq Op K mqˆm Op K mqˆp K mq

5 Compute C P RKˆK such that M CCJ.

6 Deﬁne Λ P RKˆK by Λ : CJAC, and apply Algorithm 1 to get a collection of vectors txiu K i 1 Ă RK.

7 Deﬁne the collection tviu K i 1 Ă RK as vi : Cxi for i 1, . . . , K.

8 Deﬁne the collection tδ i u K i 1 Ă Rm such that the jth component of δ i is given by δ i pjq : vipjq for j 1, . . . , m and for every i 1, . . . , K.

9 Deﬁne the ℓ2-optimal dictionary td i u K i 1 Ă Rn as d i : Bδ i for i 1, . . . , K.

10 Deﬁne the optimal scheme Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v P RK.

5. Conclusion and future directions

In this article we have provided an explicit solution of the ℓ2-optimal dictionary problem in the form of a rank-1 decomposition of a speciﬁc positive deﬁnite matrix derived from given data, together with algorithms to compute the corresponding ℓ2-optimal dictionaries. The analysis in this article assumes that the second moment of the random vector whose samples are to be represented is known. An online algorithm which estimates the second moment of the random vector and computes the dictionary vectors in parallel is being developed, and will be reported in subsequent articles.

Acknowledgments

We sincerely thank Prof. V. S. Borkar for his valuable pointers on convexity of the ℓ2-optimal dictionary problem, Prof. K. S. Mallikarjuna Rao for pointing us to the literature on rank-1 decompositions of matrices, and Prof. N. Khaneja for helpful discussions.

B. D. O. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Methods. Courier Corporation, 2007.

Optimal Dictionary for Least Squares Representation

J. J. Benedetto and M. Fickus. Finite Normalized Tight Frames. Advances in Computational Mathematics, 18(2-4):357 385, 2003.

D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 1. Athena Scientiﬁc Belmont, MA, 1995.

R. Bhatia. Matrix Analysis, volume 169. Springer-Verlag, New York, 1997.

R. Bhatia. Positive Deﬁnite Matrices. Princeton University Press, 2009.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

O. Christensen. An Introduction to Frames and Riesz Bases. Applied and Numerical Harmonic Analysis. Birkhäuser/Springer, [Cham], 2nd edition, 2016.

F. Clarke. Functional Analysis, Calculus of Variations and Optimal Control, volume 264. Springer Science & Business Media, 2013.

I. Daubechies, A. Grossmann, and Y. Meyer. Painless Nonorthogonal Expansions. Journal of Mathematical Physics, 27(5):1271 1283, 1986.

K. K. Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee, and T. J. Sejnowski. Dictionary Learning Algorithms for Sparse Representation. Neural Computation, 15(2):349 396, 2003.

D. Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press, 2012.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689 696. ACM, 2009a.

J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach. Supervised Dictionary Learning. In Advances in Neural Information Processing Systems, pages 1033 1040, 2009b.

S. G. Mallat and Z. Zhang. Matching Pursuits With Time-Frequency Dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993.

P. C. Müller and H. I. Weber. Analysis and optimization of certain qualities of controllability and observability for linear dynamical systems. Automatica, 8(3):237 246, 1972.

B. A. Olshausen and D. J. Field. Sparse Coding With an Overcomplete Basis Set: A Strategy Employed by v1? Vision Research, 37(23):3311 3325, 1997.

K. R. Parthasarathy. Probability Measures on Metric Spaces. AMS Chelsea Publishing, Providence, RI, 2005. Reprint of the 1967 original.

F. Pasqualetti, S. Zampieri, and F. Bullo. Controllability metrics, limitations and algorithms for complex networks. IEEE Transactions on Control of Network Systems, 1(1):40 52, 2014.

Mohammed Rayyan Sheriff and Debasish Chatterjee

E. B. Saﬀand A. Kuijlaars. Distributing Many Points on a Sphere. The Mathematical Intelligencer, 19(1):5 11, 1997.

M. R. Sheriﬀand D. Chatterjee. On a Frame Theoretic Measure of Quality of LTI Systems. ar Xiv preprint ar Xiv:1703.07539, 2017.

K. Skretting and K. Engan. Recursive Least Squares Dictionary Learning Algorithm. IEEE Transactions on Signal Processing, 58(4):2121 2130, 2010.

J. F. Sturm and S. Zhang. On Cones of Nonnegative Quadratic Functions. Mathematics of Operations Research, 28(2):246 267, 2003.

I. Tošić and P. Frossard. Dictionary Learning. Signal Processing Magazine, IEEE, 28(2): 27 38, 2011.

M. Yaghoobi, T. Blumensath, and M. E. Davies. Dictionary Learning for Sparse Approximations With the Majorization Method. IEEE Transactions on Signal Processing, 57(6): 2178 2191, 2009.

M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher Discrimination Dictionary Learning for Sparse Representation. In IEEE International Conference on Computer Vision (ICCV), 2011, pages 543 550. IEEE, 2011.

F. Zhang. Matrix Theory: Basic Results and Techniques. Springer Science & Business Media, 2011.

G. Zimmermann. Normalized Tight Frames in Finite Dimensions. In Recent Progress in Multivariate Approximation, pages 249 252. Springer, 2001.