# optimal_dictionary_for_least_squares_representation__7ce3f568.pdf Journal of Machine Learning Research 18 (2017) 1-28 Submitted 1/16; Revised 5/17; Published 10/17 Optimal Dictionary for Least Squares Representation Mohammed Rayyan Sheriff mohammedrayyan@sc.iitb.ac.in Debasish Chatterjee dchatter@iitb.ac.in Systems and Control Engineering IIT Bombay Mumbai 400076, India Editor: Benjamin Recht Dictionaries are collections of vectors used for the representation of a class of vectors in Euclidean spaces. Recent research on optimal dictionaries is focused on constructing dictionaries that offer sparse representations, i.e., ℓ0-optimal representations. Here we consider the problem of finding optimal dictionaries with which representations of a given class of vectors is optimal in an ℓ2-sense: optimality of representation is defined as attaining the minimal average ℓ2-norm of the coefficients used to represent the vectors in the given class. With the help of recent results on rank-1 decompositions of symmetric positive semidefinite matrices, we provide an explicit description of ℓ2-optimal dictionaries as well as their algorithmic constructions in polynomial time. Keywords: ℓ2-optimal dictionary, rank-1 decomposition, finite tight frames 1. Introduction A dictionary is a collection of vectors in a finite-dimensional vector space over R, with which other vectors of the vector space are represented. A dictionary is a generalization of a basis: While the number of vectors in a basis is exactly equal to the dimension of the vector space, a dictionary may contain more elements. In this article we consider a problem of finding an optimal dictionary, where optimality is interpreted as the minimum expected average size of the coefficients required to represent a certain collection of vectors drawn from a given probability distribution. We begin with a toy example to motivate the problems treated in this article. Let V be a random vector that attains values close to 0 2 J with high probability; the situation is demonstrated below: c 2017 Mohammed Rayyan Sheriffand Debasish Chatterjee. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/16-046.html. Mohammed Rayyan Sheriff and Debasish Chatterjee Figure 1: Comparison of two dictionaries. Suppose that our dictionary consists of the vectors d1 1 ϵ J and d2 1 ϵ J in R2, with a small positive value of ϵ. Since we must represent V using d1 and d2, the corresponding coefficients α1 and α2 must be such that α1 1 ϵ J α2 1 ϵ J V 0 2 J. A quick calculation shows that the magnitudes of the coefficients α1 and α2 should then be approximately equal to 1{pϵq with high probability. To wit, the magnitudes of these coefficients are large for small values of ϵ. It is therefore more appropriate in this situation to consider a dictionary consisting of vectors d 1 ϵ 1 J and d 2 ϵ 1 J to represent the samples of V , in which case, the magnitudes of the coefficients of the representations are closer to 1 with high probability. The latter values are comparatively far smaller compared to the values close to 1{pϵq obtained with the preceding dictionary. This simple example shows that given some statistical information about the random vectors to be represented, the question of designing a dictionary that minimizes the average cost of representation can be better addressed. Let us now turn to a situation in which considering the average cost of representations is natural. Our motivation comes from a control theoretic ideas perspective. Consider a linear time-invariant control system modeled by the recursion xpt 1q Axptq Buptq, t 0, 1, . . . , (1) where the system matrix A P Rnˆn and the control matrix B P Rnˆm are given, with the initial boundary condition xp0q x P Rn fixed. For an arbitrarily selected ˆx P Rn, consider the standard reachability problem for (1), that is: If possible, find a sequence puptqqt Ă Rm of control vectors that steer the system states to ˆx. (2) A necessary and sufficient condition for such a sequence to exist for every pair p x, ˆxq is that the rank of the matrix RKp A, Bq : B AB An 1B is equal to n, which we impose for the moment. Letting K : min k ě 0 ˇˇ rank p RKp A, Bqq n ( denote the reachability Optimal Dictionary for Least Squares Representation index of (1), we see at once that the control vectors puptqq K 1 t 0 needed to execute the transfer of the states of (1) from x to ˆx must be a solution to the linear equation t 0 At Buptq RKp A, Bq up K 1q ... up1q up0q It is now natural to consider the control cost of transferring x to ˆx, for which, a natural candidate is the associated ℓ2 performance index řK 1 i 0 uptq 2. Since in practice, the ℓ2 performance index is analogous to the amount of energy spent to control the system, its practical importance can hardly be overstated in the context of control. Let us list three examples: In attitude control/orientation problems of space vehicles, one must execute most of the rapid manoeuvre using the energy from the limited amount of fuel on board, or with the energy available from on-board batteries; minimizing the energy expenditure, therefore, is crucial. In controlled automated mobile robots (e.g., automated cars) designed to reach a given location within a certain time, reduction of energy consumption leads directly to reduction in fuel consumed. In control of electronic systems such as power electronic drives, the associated ℓ2 performance index involves information of the amount of power drawn from the electricity grid to control the system, leading directly to minimization of power consumption and thereby heating. Minimization of control effort has been an integral part of control theory, and is generally studied under the class of Linear Quadratic problems; see, e.g., (Bertsekas, 1995), (Anderson and Moore, 2007), (Clarke, 2013), (Liberzon, 2012), or any standard book on optimal control. It is evident that the task of designing control systems that require minimum control energy for their typical manoeuvres is of great importance. It is a standard practice to study the reachability problem (2), for x 0 and ˆx on the unit sphere; due to linearity of (1), this special case provides sufficient insight into the general case. Let us consider the following optimal control problem: minimize puptqqt E K 1 ÿ xpt 1q Axptq Buptq for all t 0, . . . , K 1, xp0q 0, xp Kq ˆx distributed according to µ, where µ is a probability distribution on Rd. It is known that if ˆx is uniformly distributed over the unit sphere, then the optimal control problem (3) admits an unique optimal solution and the optimum value is proportional to tr W 1 A,B , where WA,B : RKp A, Bq RKp A, Bq J is the controllability grammian of the system; for details see, e.g., (Müller and Weber, 1972) and (Pasqualetti et al., 2014). It can be readily shown that if Σ : Erˆxˆx Js is well defined, then the optimum value of (3) is equal to tr ΣW 1 A,B . Evidently, for a given distribution of Mohammed Rayyan Sheriff and Debasish Chatterjee ˆx, different linear systems (1) described completely by the pair p A, Bq incur different optimum values tr ΣW 1 A,B of (3). Against the above backdrop, consider the question of designing the linear control system (1) such that the value of (3) is as low as possible. Since most control problems involve designing control sequences to execute a class of desired manoeuvres, for a given distribution of ˆx it is then natural to design the linear systems in order to minimize the optimum value tr ΣW 1 A,B of the optimal control problem (3). In this case, the system design problem is similar to the one of finding an ℓ2-optimal dictionary as described above: here the matrices A and B are to be designed, within a feasible region, such that the column vectors constituting the matrix RKp A, Bq lead to minimal expected average cost of reachability, i.e., minimal value of (3). Such problems routinely arise in networked control, where the pair p A, Bq is a function of the constituent systems and the connectivity of the network. From an operational standpoint, it is good for a networked system to have its components connected in a way such that the resulting system incurs small expected average state transfer costs. Indeed, control systems are typically designed (Müller and Weber, 1972) by optimizing a figure of merit / measure of quality / measure of controllability; in particular, networked control systems are designed in (Pasqualetti et al., 2014) using a measure of quality defined there. Based on this work on ℓ2-optimal dictionaries, we have proposed a novel measure of quality in (Sheriffand Chatterjee, 2017), and further developments for algorithmic synthesis of largescale control systems will be reported elsewhere. Besides these applications in control theory and practice, one of the key objective of our work here is to investigate and understand the physical nature of the ℓ2-optimal dictionaries independent of their connection with control theory. Such a study will shed light on other control theoretic properties of observability and estimation. There has been significant recent research into finding optimal dictionaries, briefly outlined in (Tošić and Frossard, 2011); current research centers around the development of learning algorithms for finding optimal dictionaries. Much of the thrust is on arriving at dictionaries that offer sparse representations of sample vectors. One of the first learning algorithms to develop a dictionary that offers sparse representation of images was given in (Olshausen and Field, 1997). Since then many learning algorithms have been developed to obtain dictionaries that offer sparse representation along with other special properties such as online computation capability (Mairal et al., 2009a), better classification property (Mairal et al., 2009b; Yang et al., 2011), better adaptive properties (Skretting and Engan, 2010); several other algorithms are given in (K. Delgado et al., 2003; Yaghoobi et al., 2009; Mallat and Zhang, 1993). The problem addressed in this article differs from the mainstream research of finding dictionaries offering sparse (ℓ0-optimal) representations in the sense that our objective is to find dictionaries that give minimum average ℓ2-norm of the coefficient vector used for representation. Intuitively, optimization of the ℓ2-norm of the representation vector tends to distribute the information of the data being represented among all components of the representation vector; this makes the representation robust to accidental changes in the coefficients. An advantage of considering the ℓ2-cost is that it involves a norm arising from an inner product; consequently, it comes with a rich set of properties associated with it. These properties are crucially employed in this article to modify the intrinsically non-convex Optimal Dictionary for Least Squares Representation problem of finding an ℓ2-optimal dictionary into an equivalent convex optimization problem,1 allowing us to compute an optimal dictionary in polynomial time and arrive at analytical expressions of the optimal costs. We provide these algorithms in Sections 4.1 and 4.3. One more advantage of considering optimization in the ℓ2-sense is related to the fact that the ℓ2-cost involves the natural notion of energy which is extremely important in practice, especially in control theoretic applications. The results presented here also add to the recent developments in the advantages of representing signals/vectors using tight frames for finite-dimensional Hilbert spaces. This article unveils as follows: In Section 2 we formally introduce our problem of finding an optimal dictionary which offers least square representation. Section 2 is the heart of this article, where we solve the problem of finding an ℓ2-optimal dictionary, and arrive at an explicit solution. Algorithms to construct ℓ2-optimal dictionaries are given in Section 4, where we present the proofs of our main results. The case of representing random vectors distributed uniformly on the unit sphere is treated in Subsection 2.4; we demonstrate that the ℓ2-optimal dictionaries in this case are finite tight frames. The intermediate Section 3 contains results related to rank-1 decomposition of positive semidefinite matrices; these constitute essential tools for the solutions of our main results. We conclude in Section 5 with a summary of this work and future directions. We employ standard notations in this article. As usual, is the standard Euclidean norm. The n ˆ n identity and m ˆ n zero matrices are denoted by In and Omˆn, respectively. For a matrix M we let trp Mq and M denote its trace and Moore-Penrose pseudo-inverse, respectively. The set of n ˆ n symmetric and positive (semi-)definite matrices with real entries is denoted by Snˆn (Snˆn ), and the set of n ˆ n symmetric matrices with real entries is denoted by Snˆn. For a Borel probability measure µ defined on Rn, we let Eµr s denote the corresponding mathematical expectation. The image of a map f is written as imagepfq. The gradient of a continuously differentiable function f is denoted by f. For finite ordered sets A and B, we let A Z B denote the ordered set consisting of the elements (in their order) of A followed by the elements (in their order) of B; for instance, if A p1, 2q and B p 5, 7q, then A Z B p1, 2, 5, 7q. Suppose that A and B are two ordered sets such that B Ă A as sets, then Az B is the ordered sub-collection in A after deleting the elements of the set B. Finally, given an ordered collection of vectors pxiqn i 1 in Rν with ν ě n and equipped with the standard inner product, Ortho pxiqn i 1 gives the result of Gram-Schmidt orthonormalization of the collection pxiqn i 1 considered in the order in which they appear i.e., x1, x2, . . . , xn. 2. The ℓ2-optimal dictionary problem and its solution Let V denote an Rn-valued random vector defined on some probability space, and having distribution (i.e., Borel probability measure,) µ. We assume that V has finite variance. Let 1. By equivalence of two optimization problems we mean that an optimal solution to either of the problems can be obtained from an optimal solution to the other problem. Mohammed Rayyan Sheriff and Debasish Chatterjee RV denote the support of µ,2 and let XV be the smallest subspace of Rn containing RV . Our goal is to represent the instances/samples of V with the help of a dictionary of vectors: DK : di P Rn ˇˇ di 1 for i 1, . . . , K ( with a given K ě n, in an optimal fashion. A representation of an instance v of the random vector V is given by the coefficient vector α pα1 . . . αKq J, such that i 1 αidi. (4) A reconstruction of the sample v from the representation α is carried out by taking the linear combination řK i 1 αidi. We define the cost associated with representing v in terms of the coefficient vector α as řK i 1 α2 i . Since the dictionary vectors tdiu K i 1 must be able to represent any sample of V , the property that spantdiu K i 1 Ą RV is essential. A dictionary DK tdiu K i 1 Ă Rn is said to be feasible if spantdiu K i 1 Ą RV . We denote by DK the set of all feasible dictionaries. For a feasible dictionary DK tdiu K i 1, with m : dim spantdiu K i 1 , and for any v P RV , the linear equation (4) is satisfied by infinitely many values of α whenever K ą m. In fact, the solution set of (4) constitutes a p K mq-dimensional affine subspace of RK. Therefore, in order to represent a given v uniquely, one must define a mechanism of selecting a particular point from this affine subspace, thus making the coefficient vector α pα1 . . . αKq J a function of v. Let f denote such a function; to wit, fpvq : α is the coefficient vector used to represent the sample v. We call such a map RV Q v ÞÝÑ fpvq P RK a scheme of representation. Representation of samples of the random vector V using a dictionary DK and a scheme f is said to be proper if any vector v P RV can be uniquely represented and then exactly reconstructed back. It is clear that for proper representation of V with a dictionary DK consisting of vectors tdiu K i 1, the mapping RV Q v ÞÝÑ fpvq P RK should be an injection that satisfies V d1 d2 d K fp V q µ-almost surely. (5) A scheme f of representation is said to be feasible if for some feasible dictionary DK : tdiu K i 1 P DK the equality d1 d2 d K fp V q V is satisfied almost surely. We denote by F the set of all feasible schemes of representation. Given a scheme f of representation, the (random) cost associated with representing V is given by fp V q 2. The problem of finding an ℓ2-optimal dictionary can now be posed as: Find a pair consisting of a dictionary D K P DK and a feasible scheme f of representation such that the average cost Eµ f p V q 2 of representation is minimal. Here the subscript µ indicates the distribution of random vector V with respect to which the expectation is evaluated. In other words, we have the following optimization problem: minimize DK,f Eµ fp V q 2 # DK P DK, f P F. 2. Recall (Parthasarathy, 2005, Theorem 2.1, Definition 2.1, pp. 27-28) that the support of µ is the set of points z P Rn such that the µ-measure of every open neighbourhood of z is positive. Optimal Dictionary for Least Squares Representation The problem given in (6) will be referred to as the ℓ2-optimal dictionary problem. It should be noted that the ℓ2-optimal dictionary problem is non-convex due to the constraint that the dictionary vectors tdiu K i 1 of a feasible dictionary must be of unit length. Even if we change this constraint to t di ď 1u from t di 1u, which makes the feasible region of dictionary vectors convex, the set of feasible schemes of representation is not known to be a convex set a priori. In this article we solve the ℓ2-optimal dictionary problem given in (6) in two steps: (Step I) We let XV Rn. (Step II) We let XV be any proper nontrivial subspace of Rn.3 The remainder of this section is devoted to describing Steps I and II by exposing our main results, followed by discussions, a numerical example, and a treatment of the important case of the uniform distribution on the unit sphere of Rn. 2.1 Step I: XV Rn If XV Rn, a dictionary of vectors DK tdiu K i 1 Ă Rn is feasible if and only if di 1 for all i 1, . . . , K, and spantdiu K i 1 Rn. Thus, the ℓ2-optimization problem (6) reduces to: minimize tdiu K i 1,f Eµ fp V q 2 di 1 for all i 1, . . . , K, spantdiu K i 1 Rn, d1 d2 d K fp V q V µ-almost surely. Let ΣV : Eµr V V Js. We claim that ΣV is positive definite. Indeed, if not, then there exists a nonzero vector x P Rn such that x JV 0 almost surely, which contradicts the assumption that XV Rn. Existence and characterization of the optimal solutions to (7) is done by the following: Theorem 1. Consider the optimization problem (7), and let ΣV : Eµ V V J . (7) admits an optimal solution. The optimal value corresponding to (7) is trpΣ1{2 V q 2 Optimal solutions of (7) are characterized by: Ź a dictionary D K td i u K i 1 that is feasible for (7) and that satisfies i 1 d i d i J M : K tr Σ1{2 V Σ1{2 V , (8) and Ź a scheme f D Kpvq : d 1 d 2 d K v. Moreover, all optimal dictionary-scheme pairs can be obtained via the procedure described in Algorithm 2 on p. 22. 3. The trivial case of XV t0u is discarded because then there is nothing to prove; we therefore limit ourselves to nontrivial subspaces of Rn. Mohammed Rayyan Sheriff and Debasish Chatterjee 2.2 Step II: XV is a strict nontrivial subspace of Rn Let XV be any proper nontrivial subspace of Rn. In this situation it is reasonable to expect that no optimal dictionary that solves (6) contains elements that do not belong to XV . That this indeed happens is the assertion of the following Lemma, whose proof is provided in Section 4: Lemma 2. Optimal solutions, if any exists, of problem (6) are such that the optimal dictionary vectors td i u K i 1 satisfy d i P XV for all i 1, . . . , K. Lemma 2 guarantees that if the problem (6) admits a solution, then the corresponding optimal dictionary vectors must be elements of XV . This means that it is enough to optimize over dictionaries with their elements in XV instead of the whole of Rn. Therefore, the constraint spantdiu K i 1 Ą RV can be equivalently stated as spantdiu K i 1 XV . Let the dimension of XV be m with m ă n, and let B tbium i 1 be a basis for XV . It should be noted that XV imagepΣV q, and therefore, a basis of XV can be obtained by computing a basis of the subspace imagepΣV q. An example of such a basis of XV is the collection of unit eigenvectors of ΣV corresponding to its non-zero eigenvalues. Fix a basis B tbium i 1 of XV . Let B be a matrix containing the vectors tbium i 1 as its columns: B : b1 b2 bm . If δi is the representation of the dictionary vector di in the basis B, i.e., di Bδi, then the constraints on the family tdiu K i 1 get transformed to the following ones on tδiu K i 1: di 2 1 ñ δJ i BJB δi 1, and spantdiu K i 1 Ą RV ñ spantdiu K i 1 XV ñ spantδiu K i 1 Rm. We define the random vector VX : p BJBq 1BJ V. Then VX is an Rm valued random vector which is the representation of random vector V in the basis B. For every scheme f that is feasible for (6), let us define an associated scheme for representing samples of the random vector VX by Rm Q v ÞÝÑ f Xpvq : fp Bvq P RK. The conditions on feasibility of f in (6) imply that the scheme f X is feasible if for a feasible dictionary of vectors tδiu K i 1, δ1 δ2 δK f Xp VXq VX µ-almost surely. In other words, in contrast to the problem (6), where the optimization is carried out over vectors in Rn, we can equivalently consider the same problem in Rm, but with the following modified constraints: minimize tδiu K i 1,f X Eµ f Xp VXq 2 δJ i BJB δi 1 for all i 1, . . . , K, spantδiu K i 1 Rm, δ1 δ2 δK f Xp VXq VX µ-almost surely. Optimal Dictionary for Least Squares Representation In relation to the problem (9) let us define the following quantities $ & ΣV : Eµr V V Js Σ : p BJBq 1{2 BJΣV B p BJBq 1{2 H : K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 . Since the support of VX is m-dimensional, we conclude from previous discussion that ΣVX : Eµ VXV J X is positive definite. Since Σ p BJBq1{2ΣVXp BJBq1{2, it follows that Σ is positive definite, which in turn implies that H is positive definite. To summarize, an ℓ2-optimal dictionary-scheme pair that solves the optimization problem (6) is equivalently obtained from an optimal solution of the problem (9), and is characterized by the following: Theorem 3. Consider the optimization problem (9). (9) admits an optimal solution. The optimal value corresponding to (9) is Optimal solutions of (9) are characterized by: Ź a dictionary D K tδ i u K i 1 that is feasible for (9) and that satisfies i 1 δ i δ i J H , (11) and Ź a scheme f Xpuq : δ 1 δ 2 δ K u. Consequently, an optimal solution of the ℓ2-optimal dictionary problem (6) consisting of an ℓ2-optimal dictionary-scheme pair is given by A collection of vectors td i u K i 1 defined as d i : Bδ i for i 1, 2, . . . , K, and the scheme f pvq : d 1 d 2 d K v. Moreover, all optimal dictionary-scheme pairs can be obtained via the procedure given in Algorithm 3 on p. 26. 2.3 Discussion and a numerical example Remark 4. The problem (6) does not a priori hypothesize an affine/linear structure of candidate schemes. The fact that linear schemes are optimal in (6) is one of the crucial assertions of both Theorem 1 and Theorem 3. Remark 5. Algorithmic computation of an ℓ2-optimal dictionary relies on the second moment ΣV of the random vector V . Complete knowledge of the distribution µ is, therefore, unnecessary. This is an advantage since in practical situations, learning/estimating ΣV from data is comparatively less demanding than getting a description of the distribution µ itself. Remark 6. Let M P Snˆn be such that imagep Mq XV , let B tbium i 1 be a basis for XV evaluated as a basis for imagep Mq. Let B : b1 b2 bm Mohammed Rayyan Sheriff and Debasish Chatterjee Σp Mq : p BJBq 1{2 BJMB p BJBq 1{2 tr Σp Mq 1{2 p BJBq 1{2 Σp Mq 1{2p BJBq 1{2 . Suppose that tdiu K i 1 and fp q are the dictionary and the scheme obtained using the procedure given in Algorithm 3 using M and K as inputs. By simplifying the pseudo-inverse d1 d2 d K in fp q, the average cost Jp Mq of representing V using the scheme fp q turns out to be Jp Mq Eµ V JBp BJBq 1 Hp Mq 1p BJBq 1BJV ı tr Hp Mq 1p BJBq 1BJΣV Bp BJBq 1 tr Hp Mq 1p BJBq 1{2 Σ p BJBq 1{2 tr p BJBq 1{2 Hp Mq 1p BJBq 1{2 Σ K tr Σp Mq 1{2 tr Σp Mq 1{2Σ . Let S : T P Snˆn ˇˇ imagep Tq XV ( . Since the sequence of maps S Q T ÞÝÑ Σp Tq P Smˆm , Smˆm Q T ÞÝÑ T 1{2 P Smˆm , Smˆm Q T ÞÝÑ T 1Σ P Smˆm , Smˆm Q T ÞÝÑ trp Tq P R, are, evidently, continuous, it follows at once that the map S Q M ÞÝÑ Jp Mq P R is also continuous. If pΣV denotes the estimated second moment of V , and the estimation is carried out with a large enough number of samples of V , with probability one we have imageppΣV q XV . Therefore, by continuity of M ÞÝÑ Jp Mq, we see at once that JppΣV q ÝÝÝÝÝÝÑ pΣV ÝÑΣV JpΣV q Remark 7. The optimal average cost of representation of a random vector V is inversely proportional to the size K of the optimal dictionary, as is evident from the optimal costs in Theorems 1 and 3. To wit, the optimal average cost of representation decreases monotonically with K, which is expected. Remark 8. ℓ2-optimal dictionaries for representing a random vector V are also optimal for representing any scalar multiple αV of V for any 0 α P R. Indeed, it is clear that H defined in (10) is invariant under nonzero scalar multiplications of V . Therefore, ℓ2-optimal dictionaries are also invariant under nonzero scalar multiplications of the random vector V . This fact also follows from the observation made in Remark 4. Remark 9. An ℓ2-optimal dictionary as characterized by Theorem 3 appears there in the form of what is known as a rank-1 decomposition of the positive definite matrix H . Elements Optimal Dictionary for Least Squares Representation of the theory of rank-1 decompositions of positive definite matrices is discussed below in Section 3. This particular decomposition plays a crucial rôle in transforming the search space of the ℓ2-optimal dictionary problem (7) from the set of dictionaries to the set of symmetric positive definite matrices with real entries, and translating the non-convex ℓ2optimal dictionary problem into a tractable convex one. Remark 10. All ℓ2-optimal dictionaries are unique upto rank-1 decompositions of a unique positive definite matrix that is obtained from the second moment Er V V Js of the random vector V . That is, for a given random vector whose samples are to be optimally represented, every ℓ2-optimal dictionary is obtained from a rank-1 decomposition of a unique positive definite matrix. Remark 11. Looking ahead at Algorithm 3, it becomes evident that non-uniqueness of optimal dictionaries can be attributed to the non-uniqueness in the selection of C in Step 5 of Algorithm 3, and the element of choice associated to the selection of pj and pk in Step 2 of Algorithm 1. The number of optimal solutions may be infinite depending on the distribution of the random vector V . For instance, if V is uniformly distributed over the unit sphere of Rn and K n, then the elements in an ℓ2-optimal dictionary form an orthonormal basis of Rn. (The special case of uniform distribution of V over spheres is discussed in Section 2.4.) Of course, there are infinitely many orthonormal bases of Rn for n ě 2. Remark 12. From Algorithm 3 on p. 26 we can infer that by calculating the matrix B there, consisting of the eigenvectors of ΣV corresponding to its non-zero eigenvalues, the computations of p BJBq 1{2, Σ1{2, and C in the decomposition given in Step 5 become straightforward. Therefore, the chief computational load in Algorithm 3 consists of eigendecomposition of ΣV and that in Algorithm 1 (in Step 6), both of which can be performed in polynomial time. Example 1. Let V ˆV1 V2 be a random vector taking values in R2, with V1 and V2 being independent random variables. Let the density functions of V1 and V2 be ρV1pvq 2pv 1q1r1,2spvq and ρV2pvq 2p2 vq1r1,2spvq, respectively. The support of V is, therefore, the square r1, 2s ˆ r1, 2s. Elementary calcu- lations lead to ΣV : Eρr V V Js ˆ17{6 20{9 20{9 11{6 . We employed the procedure described in Algorithm 2 for the given matrix ΣV and K 3 in matlab. An optimal dictionary ty 1, y 2, y 3u was obtained, with y 1 ˆ0.9789 0.2045 , y 2 ˆ0.6792 0.7339 , y 3 ˆ0.5870 0.8096 the optimum value of the objective function was reported to be 1.8930. This collection ty i u3 i 1 of optimal vectors are marked with crosses on the circumference of the unit circle shown in Figure 2. A second optimal dictionary tz 1, z 2, z 3u was obtained, also using Algorithm 2, with dictionary vectors z 1 ˆ0.4214 0.9069 , z 2 ˆ0.9284 0.3717 , z 3 ˆ0.8513 0.5247 Mohammed Rayyan Sheriff and Debasish Chatterjee with an identical optimal value as in the former case. The vectors tz i u3 i 1 are marked with dark circles on the circumference of the unit circle in Figure 2. Figure 2: The two optimal dictionaries in Example 1. It is expected that the optimal dictionary vectors are concentrated towards the bottom right corner of the support r1, 2s ˆ r1, 2s (the region with strong shading in figure 2). In the optimal solution tz i u3 i 1, two vectors z 2 and z 3 point to the region where density of V is concentrated the most. Also, for the solution ty i u3 i 1, two vectors y 2 and y 3 are oriented towards the center of the square r1, 2s ˆ r1, 2s, with the remaining vector pointing towards the region of higher density. These results correlate positively with what may be expected out of ℓ2-optimal dictionaries. 2.4 Uniform distribution over the unit sphere We shall test our results on the important case of µ being the uniform distribution on the unit sphere. Note that due to (rigid) rotational symmetry of the distribution, it follows that rigid rotations of optimal dictionaries in this case are also optimal. Let us consider a dictionary consisting of (unit) vectors that are close to each other, i.e., the inner product between any two elements of the dictionary is close to 1. It is quite evident that such a dictionary is not optimal for representing uniformly distributed samples due to the fact that samples of V that are almost orthogonal to the dictionary vectors carry equal priority as any other vector but require large coefficients for their representation. It is, therefore, more natural to search for dictionaries in which the constituent vectors are maximally spaced out . Several examples of collections of vectors that are maximally spaced out may be found in (Benedetto and Fickus, 2003, Section 4). Collections of vectors that are maximally far apart from each other are known to attain equilibria under the actions of different kinds of forces defined and explained in (Benedetto and Fickus, 2003, Section 4) and (Saffand Kuijlaars, 1997, p. 6). Such collections of vectors are generalized by the ideas of tight frames as explained in (Benedetto and Fickus, 2003); see also (Christensen, 2016; Daubechies et al., 1986; Benedetto and Fickus, 2003; Zimmermann, 2001) for related information. Optimal Dictionary for Least Squares Representation We recall here some standard definitions for completeness and to provide the necessary substratum for our next result. Let n, K be positive integers such that K ě n. We say that a collection of vectors txiu K i 1 is a frame for Rn if there exist some constants c, C ą 0 such that i 1 xxi, xy2 ď C x 2 for all x P Rn. We say that a frame txiu K i 1 Ă Rn is tight if c C. In addition, if txiu K i 1 Ă Rn is a tight frame and xi 1 for all i 1, 2, . . . , K, we say that the collection txiu K i 1 is a c-unit norm tight frame (a c-UNTF). We have the following connection between ℓ2-optimal dictionaries and UNTFs: Proposition 13. A dictionary DK tdiu K i 1 is optimal for representing samples of a random vector V that is uniformly distributed over the surface of the unit sphere of Rn if and only if the collection tdiu K i 1 of vectors constitute a K Proof If V is uniformly distributed over the unit sphere, we have ΣV Er V V Js 1 n In. According to Theorem 1 the collection tdiu K i 1 is an optimal dictionary if and only if i 1 did J i K tr 1 ?n In 1 ?n In K Since the family tdiu K i 1 must span Rn by definition, it is a frame. The frame operator for the frame tdiu K i 1 is given by (Benedetto and Fickus, 2003, Section 2) Rn Q y ÞÝÑ Spyq : i 1 xdi, yy di ˆ K ÿ i 1 did J i where xv, wy v Jw is the standard inner product in Rn. (Benedetto and Fickus, 2003, Theorem 3.1) asserts that a collection of unit norm vectors tdiu K i 1 forms a tight frame in Rn if and only if the collection is a K n -UNTF. From (Benedetto and Fickus, 2003, Theorem 2.1) it follows that a collection of vectors tdiu K i 1 is a K n -UNTF if and only if i 1 did J i K The assertion follows from (13) and (14). 3. A particular class of rank-1 decompositions of matrices We collect and establish here some results on the theory of rank-1 decompositions of matrices. While these facts will be needed for our main results, they are also of independent interest. A standard result in matrix theory (Bhatia, 2009, p. 2) states that a symmetric positive semidefinite matrix with real entries M P Snˆn , can be decomposed as Y Y J for some Mohammed Rayyan Sheriff and Debasish Chatterjee Y P Rnˆr, where r : rankp Mq. Let yi indicate the i th column of the matrix Y . Then the equality M Y Y J is equivalent to i 1 yiy J i . More generally for K ě r, let M : ˆ M Onˆp K rq Op K rqˆn IK r where O is a zero matrix of order n ˆ p K rq. If we consider the decomposition of M as M Y Y J with Y P Rpn K rqˆK, and indicate by Y the upper n ˆ K matrix block of Y , we get M Y Y J. In other words i 1 yiy J i . (15) There are numerous ways of decomposing positive semidefinite matrices; some of them are discussed in (Zhang, 2011, Theorem 7.3). The speciality of a particular decomposition lies in the characteristics exhibited by the vectors yi s. A particular rank-1 decomposition which we will use to solve the ℓ2-optimal dictionary problem is the one where for every M P Snˆn and K ě r : rankp Mq there exists a collection of vectors tyiu K i 1 Ă Rn that satisfy i 1 yiy J i and y J i yi trp Mq K for all i 1, . . . , K. (16) We are now in a position to present Algorithm 1 and its associated Theorem 14, whose corollaries will give us the needed rank-1 decomposition of (16). We mention that Algorithm 1 is, in principle, similar to Procedure 1 of (Sturm and Zhang, 2003), and in particular, the assertions of Theorem 14 and its corollaries can be obtained by applying (Sturm and Zhang, 2003, Proposition 3 and Corollary 4) via some straightforward modifications. However, we provide the complete proofs here for the sake of completeness. Theorem 14. For any matrix Λ P Rnˆn there exists an orthonormal collection pxiqn i 1 Ă Rn of vectors satisfying x J i Λxi trpΛq n for all i 1, . . . , n, Moreover, such a collection can be obtained from Algorithm 1. Proof First we establish that the collection of vectors pxiqn 1 i 1 contained in Sn 1 (recall that Sn 1 is generated in the for loop in the Algorithm 1,) are orthonormal, and satisfy x J i Λxi trpΛq n for i 1, . . . , n 1. We shall prove this by induction on i. The induction base: For i 1, we have P1 pe1, e2, . . . , enq. Since řn m 1 e J mΛem trpΛq, vectors pj, pk P P1 exist such that p J j Λpj ď trpΛq n ď p J k Λpk. We solve for θ in the equation gpj;pkpθq : p1 θqpj θpkq JΛpp1 θqpj θpk pp1 θq2 θ2q trpΛq Optimal Dictionary for Least Squares Representation Algorithm 1: Calculation of orthonormal bases à la Theorem 14 Input: A matrix Λ P Rnˆn. Output: An orthonormal collection of vectors pxiqn i 1 Ă Rn such that x J i Λxi trpΛq n for all i 1, . . . , n. 1 Initialize quantities by S0 H, i 1. 2 for i from 1 to pn 1q 3 do S1 i Si 1 Z pe1, e2, . . . , enq. Pi Orthop S1 iqz Si 1. Find pj, pk P Pi such that p J j Λpj ď trpΛq n ď p J k Λpk. Let Θ P r0, 1s be a solution of the equation (in θ) p1 θqpj θpk JΛ p1 θqpj θpk trpΛq Define xi : p1 Θqpj Θpk pp1 Θq2 Θ2q1{2 . Define Si : Si 1 Z pxiq. 4 end for loop 5 S1 n Sn 1 Z pe1, e2, . . . , enq. 6 Output Sn : Orthop S1 nq. We know that a solution exists in r0, 1s because for θ 0 we have gpj;pkp0q pp1 θqpj θpkq JΛpp1 θqpj θpkq pp1 θq2 θ2q θ 0 p J j Λpj ď trpΛq for θ 1 we have gpj;pkp1q pp1 θqpj θpkq JΛpp1 θqpj θpkq pp1 θq2 θ2q θ 1 p J k Λpk ě trpΛq and gpj;pkp q is a continuous function of θ. Let Θ be such a solution. Then, following the notation in Algorithm 1, we have x1 : p1 Θqpj Θpk a p1 Θq2 Θ2 . Since pj, pk are elements of P1, they are orthonormal; therefore, p1 Θq2 pj 2 Θ2 pk 2 p1 Θq2 Θ2 1, and since Θ is a solution of equation (17) we have x J 1 Λx1 trpΛq Mohammed Rayyan Sheriff and Debasish Chatterjee Induction hypothesis: Assume that for some i between 1 and n 1 the collection Si pxℓqi ℓ 1 is orthonormal, and satisfies x J ℓΛxℓ trpΛq n for all ℓ 1, . . . , i. Induction step: In view of the induction hypothesis, we define S1 i 1 : Si Z pe1, e2, . . . , enq px1, x2, . . . xi, e1, e2, . . . , enq, and compute Orthop S1 i 1q px1, x2, . . . , xi, p1, p2, . . . , pn iq, Pi 1 pp1, p2, . . . , pn iq as in Algorithm 1. Since the collection pxℓqi ℓ 1 Z ppℓqn i ℓ 1 is an orthonormal basis for Rn, we have iÿ ℓ 1 x J ℓΛxℓ ℓ 1 p J ℓΛpℓ trpΛq, leading to n i ÿ ℓ 1 p J ℓΛpℓ pn iq Thus, there exist vectors pj, pk P Pi 1 such that p J j Λpj ď trpΛq n ď p J k Λpk. Let us consider the equation gpj,pkpθq : pp1 θqpj θpkq JΛpp1 θqpj θpkq pp1 θq2 θ2q trpΛq in θ. From arguments given in the case of i 1, we know that a solution Θ of (18) exists on r0, 1s. We define xi 1 : p1 Θqpj Θpk a p1 Θq2 Θ2 . Since pj, pk are orthogonal to the vectors pxℓqi ℓ 1, so is any linear combination of pj, pk. Therefore, xi 1 is orthogonal to the vectors pxℓqi ℓ 1, which, along with the fact that p1 Θq2 pj 2 Θ2 pk 2 p1 Θq2 Θ2 1, makes the collection pxℓqi 1 ℓ 1 orthonormal. Also, since Θ is a solution of (18), we get x J i 1Λxi 1 trpΛq Therefore, by mathematical induction, we conclude that the collection pxiqn 1 i 1 contained in Sn 1 has the required properties. Optimal Dictionary for Least Squares Representation Finally, in the 4th and 5th steps of Algorithm 1, we get S1 n px1, x2, . . . , xn 1, e1, e2, . . . , enq, and Orthop S1 nq px1, x2, . . . , xn 1, xnq. By construction, pxℓqn ℓ 1 is an orthonormal collection, implying that řn i 1 x J i Λxi trpΛq. In turn, this leads to i 1 x J i Λxi i 1 x J i Λxi Thus, Algorithm 1 yields a collection of orthonormal vectors pxiqn i 1 such that x J i Λxi trpΛq n for all i 1, 2, . . . , n, thereby completing the proof. Corollary 15 (Rank-1 decomposition). Let X P Snˆn , define r : rankp Xq, and let T P Snˆn. There exists a collection of vectors txiur i 1 Ă Rn such that j 1 xjx J j , and x J i Txi 1 r trp XTq for all i 1, . . . , r. Proof We know (Bhatia, 2009, p. 2) that any symmetric positive semidefinite matrix X with real entries and of rank r can be decomposed as CCJ where C P Rnˆr. Let us define Λ P Rrˆr as Λ : CJTC. According to Theorem 14 a collection of orthonormal vectors tyiur i 1 Ă Rr can be obtained such that y J i CJT Cyi y J i Λyi trpΛq We define a collection txiur i 1 Ă Rr by xi : Cyi for i 1, . . . , r. Then i 1 xix J i C ˆ rÿ i 1 yiy J i CJ CIr CJ X. Moreover, for every i 1, . . . , r, x J i Txi y J i CJTCyi trpΛq r trp CJTCq 1 The assertion follows. Corollary 15 is generalized slightly by the following one; we shall employ this particular form to solve the ℓ2-optimal dictionary problem in Theorem 1. Mohammed Rayyan Sheriff and Debasish Chatterjee Corollary 16. Let M P Snˆn and define r : rankp Mq. Let A P Snˆn and K ě r be given. There exists a collection of vectors tyiu K i 1 Ă Rn such that j 1 yjy J j , and y J i Ayi 1 K trp MAq for all i 1, . . . , K. (19) Proof Let us consider the square matrices X, T of order K n r in Corollary 15 to be X : ˆ M Onˆp K rq Op K rqˆn IK r and T : ˆ A Onˆp K rq Op K rqˆn Op K rqˆp K rq Then rankp Xq K by construction. Therefore, vectors txiu K i 1 Ă Rn K r exist satisfying the properties in Corollary 15. Let us denote Rn Q yi : xi1 . . . xin J for i 1, . . . , K; in other words, yi is the vector formed by the first n components of xi. Then i 1 yiy J i M, and for any i 1, . . . , K, y J i Ayi x J i Txi 1 K trp XTq 1 The assertion follows at once. 4. Proofs of Theorem 1, Lemma 2, and Theorem 3 4.1 Proof of Theorem 1 Proof For a given dictionary DK P DK of vectors tdiu K i 1 that is feasible for (7), let us define a scheme of representation Rn Q v ÞÝÑ f DKpvq : d1 d2 d K v P RK. Quite clearly, d1 d2 d K f DKpvq v for any v P Rn by the definition of the pseudo-inverse because if spantdiu K i 1 Rn, then d1 d2 d K v solves the equation d1 d2 d K x v. Therefore, d1 d2 d K f DKp V q V µ-almost surely. We know that f DKpvq d1 d2 d K v is the solution of the least squares problem minimize x PRK x 2 subject to d1 d2 d K x v. Optimal Dictionary for Least Squares Representation Therefore, for an arbitrary f P F such that d1 d2 d K fpvq v for all v P Rn, we must have f DKpvq 2 ď fpvq 2 for all v P Rn. Therefore, f DKp V q 2 ď fp V q 2 µ-almost surely, and hence, Eµ f DKp V q 2 ď Eµ fp V q 2 . Minimizing over all feasible dictionaries and schemes, we get inf DKPDK Eµ f DKp V q 2 ď inf DKPDK, f PF Eµ fp V q 2 (20) The problem on the left-hand side of the inequality (20) is minimize tdiu K i 1 Eµ f DKp V q 2 # di 1 for all i 1, . . . , K, spantdiu K i 1 Rn. From (20) we can conclude that the optimal value, if it exists, of problem (7) is bounded below by the optimal value, if it exists, of the one given in (21). Our strategy is to demonstrate that optimization problem (21) admits a solution, and we shall furnish a feasible solution of (7) that achieves a value of the objective function that is equal to the optimal value of the problem (21). This will solve (7). Let D : d1 d2 d K . The objective function in (21) can be computed as Eµ f DKp V q 2 Eµ D V 2 Eµ V Jp D q JD V Eµ V J DJp DDJq 1 J DJp DDJq 1 V Eµ V Jp DDJq 1DDJp DDJq 1V Eµ V Jp DDJq 1V Eµ trp V Jp DDJq 1V q Eµ trp V V Jp DDJq 1q tr Eµ V V J p DDJq 1 . Letting ΣV : Eµ V V J and writing DDJ řK i 1 did J i the optimization problem (21) is rephrased as minimize tdiu K i 1 tr ˆ ΣV i 1 did J i # di 1 for all i 1, . . . , K, spantdiu K i 1 Rn. Mohammed Rayyan Sheriff and Debasish Chatterjee Let S be the feasible set for the problem in (22). At first (22) appears to be non-convex. Let us demonstrate that the objective function of (22) is convex in DDJ. We know that whenever ΣV is a positive definite matrix, trpΣV M 1q tr Σ1{2 V M 1Σ1{2 V tr Σ 1{2 V MΣ 1{2 V 1 . From (Bhatia, 1997, p. 113 and Exercise V.1.15, p. 117) we know that inversion of a matrix is a matrix convex map on the set of positive definite matrices. Therefore, for any θ P r0, 1s and M1, M2 P Snˆn we have Σ 1{2 V p1 θq M1 θM2 Σ 1{2 V 1 p1 θq Σ 1{2 V M1Σ 1{2 V θ Σ 1{2 V M2Σ 1{2 V 1 ĺ p1 θq Σ 1{2 V M1Σ 1{2 V 1 θ Σ 1{2 V M2Σ 1{2 V 1 , (23) where A ĺ B implies that B A is positive semidefinite. Since trp q is a linear functional over the set of n ˆ n matrices we have tr ΣV p1 θq M1 θM2 1 tr Σ 1{2 V p1 θq M1 θM2 Σ 1{2 V 1 ď p1 θq tr Σ 1{2 V M1Σ 1{2 V 1 θ tr Σ 1{2 V M2Σ 1{2 V 1 ď p1 θq trpΣV M 1 1 q θ trpΣV M 1 2 q. In other words, the function M ÞÝÑ trpΣV M 1q is a convex function on the set of symmetric and positive definite matrices. Moreover, we know that for a collection tdiu K i 1 that is feasible for (22), DK Q tdiu K i 1 ÞÝÑ hpd1, . . . , d Kq : i 1 did J i maps into the set of positive definite matrices. Therefore, the objective function in (22) is a convex function on imagephq. This allows us to translate the feasible set of the optimization problem (22) to the set of matrices M formed by all feasible collections tdiu K i 1, i.e., on hp DKq. Let R : M P Snˆn ˇˇ trp Mq K ( . On the one hand, from Corollary 16 with A In, we know that any symmetric and positive definite matrix M P R can be decomposed as i 1 did J i with di K 1 for all i 1, . . . , K. The fact that M is positive definite implies that spantdiu K i 1 Rn. Therefore, tdiu K i 1 P DK and M hpd1, . . . , d Kq, which implies that R Ă hp Sq. (24) Optimal Dictionary for Least Squares Representation On the other hand, for any collection of vectors tdiu K i 1 P DK, we have hpd1, . . . , d Kq řK i 1 did J i P Snˆn and tr hpd1, . . . , d Kq řK i 1 d J i di K. Therefore, by definition of R, hp Sq Ă R. (25) From (24) and (25) we conclude that hp DKq R. The optimization problem (22) is, therefore, equivalent to the one where the feasible set is the set of positive definite matrices with trace K, i.e., from (22), minimize MPSnˆn tr ΣV M 1 subject to trp Mq K 0. (26) The optimization problem in (26) is convex since its objective function is convex (as a function of M) and the feasible region is the intersection of a convex cone Snˆn and the affine space M P Rnˆn ˇˇ trp Mq K 0 ( . In the light of (Boyd and Vandenberghe, 2004, p. 244) it follows that (26) can be solved by considering just the first order optimality conditions. These first order optimality conditions are expressed in terms of a Lagrangian Lp M, γq : trp M 1ΣV q γ trp Mq K , containing a KKT multiplier γ at an optimal point M as 0 MLp M , γq M trp M 1ΣV q γ trp Mq K ˇˇˇˇ M M p M q 1ΣV p M q 1 J γIn. But since M , ΣV P Snˆn , by symmetry it follows that p M q 1ΣV p M q 1 γIn, leading to ΣV γp M q2. (28) Since ΣV Onˆn, we get γ 0, and write M as M 1 ?γ Σ1{2 V . To evaluate γ we use the fact that by construction K trp M q 1 ?γ tr Σ1{2 V , which gives γ ˆtr Σ1{2 V In other words, the final expression of the optimizer M in the problem (26) is tr Σ1{2 V Σ1{2 V . (29) It follows that the optimal value of the problem (26) (and therefore of (22)) is trpΣ1{2 V q 2 K . Therefore, this value must be a lower bound of the optimal value, if it exists, for the problem (7). Mohammed Rayyan Sheriff and Debasish Chatterjee Employing Corollary 16 with A In, we decompose M as i 1 d i d i J with d i 1 for each i 1, . . . , K. (30) Let us consider the dictionary D K consisting of the vectors td i u K i 1 obtained above. Since XV Rn, the matrices ΣV , Σ1{2 V , and M are of rank n, and therefore, spantd i u K i 1 Rn. Along with the fact that d i 1, we see that the dictionary D K of vectors td i u K i 1 is feasible for the problem (7). Let us define the scheme Rn Q v ÞÝÑ f D Kpvq : d 1 d 2 d K v P RK. It is evident that this scheme f D K is feasible for (7). But then the objective function in (7) evaluated at DK D K and f f D K must be equal to trpΣ1{2 V q 2 K . Since this particular value is also a lower bound for the optimal value of (7), the problem (7) is solvable. An optimal dictionary-scheme pair is given by $ & D K td i u K i 1 obtained from the decomposition (30), and Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v P RK. (31) The proof is now complete. We provide the Algorithm 2 that computes optimal dictionary-scheme pairs for the case XV Rn. The inputs to the algorithm are the matrix ΣV and the size K of a dictionary: Algorithm 2: ℓ2-optimal dictionary for the case XV Rn. Input: A matrix ΣV P Snˆn and a number K ě n. Output: An ℓ2-optimal dictionary-scheme pair td i u K i 1, f . 1 Define M1 : K tr Σ1{2 V Σ1{2 V . 2 Define M2 : ˆ M1 Onˆp K nq Op K nqˆn IK n , A : ˆ In Onˆp K nq Op K nqˆn Op K nqˆp K nq 3 Compute C P RKˆK such that M2 CCJ. 4 Define Λ P RKˆK by Λ : CJAC, and apply Algorithm 1 to get a collection of vectors txiu K i 1 Ă RK. 5 Define the collection tviu K i 1 Ă RK by vi : Cxi for i 1, . . . , K. 6 Define the ℓ2-optimal dictionary td i u K i 1 Ă Rn such that the jth component of d i is given by d i pjq : vipjq for j 1, . . . , n and for every i 1, . . . , K. 7 Define the optimal scheme Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v. 4.2 Proof of Lemma 2 Proof We argue by contradiction. Suppose that the assertion of the Lemma is false. If we denote by xi the orthogonal projection of di on XV and by yi the orthogonal projection of Optimal Dictionary for Least Squares Representation di on the orthogonal complement of XV , we must have xi ă 1 for at least one value of i. If f is an optimal scheme of representation, feasibility of f gives, for any v P RV , i 1 difipvq ˆ K ÿ i 1 xifipvq ˆ K ÿ i 1 yifipvq xifipvq 0. (32) Fix a unit vector x P XV , and define a dictionary td i u K k 1 by xi xi if xi 0, x otherwise. Then clearly spantd i u K i 1 Ą spantxiu K i 1 Ą RV and d i 1 for all i 1, . . . , K. In other words, the dictionary of vectors td i u K i 1 is feasible for the problem (6). Let us now define a scheme f by Rn Q v ÞÝÑ f pvq : diagt x1 , x2 , . . . , x K ufpvq P RK. For any v P RV , using the dictionary consisting of vectors td i u K i 1 we get i 1 d i f i pvq i 1 d i xi fipvq xi xi xi fipvq v, (33) where the last equality follows from (32). Thus, f p q along with the dictionary of vectors td i u K i 1 is feasible for problem (6). But for any v P RV we have i 1 xi 2 fipvq 2 ă fipvq 2 fpvq 2 , where the inequality is due to the fact that xi ă 1 for at least one i. This contradicts the assumption that the pair tdiu K i 1 along with the scheme f is optimal for (6). 4.3 Proof of Theorem 3 Proof The problem (9) is similar to problem (7) except for the first constraint. In (7) we optimize over vectors taking values on the surface of the unit sphere, whereas in (9) we optimize over vectors taking values on the surface of the ellipsoid x P Rm ˇˇ x Jp BJBqx 1 ( . Following the arguments in the proof of Theorem 1 till (22), one can conclude that the Mohammed Rayyan Sheriff and Debasish Chatterjee optimal value, if it exists, of problem (9) is bounded below by the optimal value, if it exists, of the problem minimize tδiu K i 1 tr ˆ ΣVX # δJ i p BJBqδi 1 for all i 1, 2, . . . , K, spantδiu K i 1 Rm, where ΣVX : Eµ VXV J X p BJBq 1BJ Eµ V V J p BJBq 1BJ J. Let us define: S to be the feasible region of the problem (34), R : H P Smˆm ˇˇ tr Hp BJBq q K ( , and the map Rm K Q pδ1, δ2, . . . , δKq ÞÝÑ hpδ1, δ2, . . . , δKq : řK i 1 δiδJ i P Smˆm . From Corollary 16 we see that for every H P R there exists a collection of vectors tδiu K i 1 such that K ÿ i 1 δiδJ i H and δJ i p BJBqδi tr Hp BJBq which, along with the fact that rankp Hq m ñ spantδiu K i 1 Rm, imply that R Ă hp Sq. (35) Moreover, for any collection tδiu K i 1 P S, we have tr hpδ1, δ2, . . . , δKqp BJBq i 1 δJ i p BJBqδi K and hpδ1, δ2, . . . , δKq P Smˆm , which implies that hp Sq Ă R. (36) From (35) and (36) we conclude that R hp Sq. In other words, instead of optimizing over the feasible collection of vectors in S in (34), one can equivalently optimize over the set of symmetric positive definite matrices in R. This consideration leads us to the problem: minimize H P Smˆm tr ΣVXH 1 subject to tr Hp BJBq K 0. (37) Letting M : p BJBq1{2Hp BJBq1{2, we write the optimization problem (37) with M as the variable instead of H. Due to this change of variables, the constraint and the objective function become tr Hp BJBq tr p BJBq1{2Hp BJBq1{2 trp Mq, and tr ΣVXH 1 tr ΣVXp BJBq1{2M 1p BJBq1{2 tr p BJBq1{2ΣVXp BJBq1{2M 1 Optimal Dictionary for Least Squares Representation where Σ : p BJBq1{2ΣVXp BJBq1{2 p BJBq1{2 p BJBq 1BJ Eµ V V J p BJBq 1BJ Jp BJBq1{2 p BJBq 1{2 BJΣV B p BJBq 1{2. Using (38) we write the problem (37) equivalently as: minimize MPSnˆn tr ΣM 1 subject to trp Mq K 0. (40) The problem (40) is identical to (26), which implies that the problem (40) is solvable, and an optimizer is M : K tr Σ1{2 Σ1{2. Therefore, the problem (37) is solvable, and an optimizer is H : p BJBq 1{2M p BJBq 1{2 K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 . (41) From Corollary 16 it follows that there exists a collection tδ i u K i 1 of vectors such that i 1 δ i δ i J H and δ i J BJB δ i tr H p BJBq Employing arguments similar to those given in the proof of Theorem 1, we now conclude that the pair the collection of vectors tδ i u K i 1, and the scheme f Xpuq δ 1 δ 2 δ K u, is optimal for the problem (9). Using the optimal solution of (9), we define a dictionaryscheme pair as: $ & d i : Bδ i for i 1, . . . , K, Rn Q v ÞÝÑ f pvq : f X p BJBq 1BJ v δ 1 δ 2 δ K p BJBq 1BJ v. (42) It is clear that the pair in (42) is feasible for the problem (6), and that the corresponding objective function evaluates to the optimal value of the problem (34). Therefore, along with the assertion of Lemma 2 we can conclude that the problem (6) is solvable, and in fact an optimal solution is given by (42) with the optimal value of K . This completes the proof. As in the case XV Rn, we now provide the Algorithm 3 to obtain an optimal dictionaryscheme pair for the general ℓ2-optimal dictionary problem (6). The algorithm takes the Mohammed Rayyan Sheriff and Debasish Chatterjee matrix ΣV and the size of the dictionary K as its inputs. From ΣV we extract a matrix B P Rnˆm containing a set of basis vectors for imagepΣV q in its columns, these vectors form a basis for XV . Algorithm 3: A procedure to obtain ℓ2-optimal dictionary. Input: A matrix ΣV P Snˆn and a number K ě m : dimp XV q rankpΣV q. Output: An ℓ2-optimal dictionary-scheme pair ty i u K i 1, f . 1 Compute a basis tbium i 1 for imagepΣV q and define B : b1 b2 bm . 2 Define Σ : p BJBq 1{2 BJΣV B p BJBq 1{2. 3 Compute H : K tr Σ1{2 p BJBq 1{2Σ1{2p BJBq 1{2 . 4 Define M : ˆ H Omˆp K mq Op K mqˆm IK m , A : ˆ BJB Omˆp K mq Op K mqˆm Op K mqˆp K mq 5 Compute C P RKˆK such that M CCJ. 6 Define Λ P RKˆK by Λ : CJAC, and apply Algorithm 1 to get a collection of vectors txiu K i 1 Ă RK. 7 Define the collection tviu K i 1 Ă RK as vi : Cxi for i 1, . . . , K. 8 Define the collection tδ i u K i 1 Ă Rm such that the jth component of δ i is given by δ i pjq : vipjq for j 1, . . . , m and for every i 1, . . . , K. 9 Define the ℓ2-optimal dictionary td i u K i 1 Ă Rn as d i : Bδ i for i 1, . . . , K. 10 Define the optimal scheme Rn Q v ÞÝÑ f pvq : d 1 d 2 d K v P RK. 5. Conclusion and future directions In this article we have provided an explicit solution of the ℓ2-optimal dictionary problem in the form of a rank-1 decomposition of a specific positive definite matrix derived from given data, together with algorithms to compute the corresponding ℓ2-optimal dictionaries. The analysis in this article assumes that the second moment of the random vector whose samples are to be represented is known. An online algorithm which estimates the second moment of the random vector and computes the dictionary vectors in parallel is being developed, and will be reported in subsequent articles. Acknowledgments We sincerely thank Prof. V. S. Borkar for his valuable pointers on convexity of the ℓ2-optimal dictionary problem, Prof. K. S. Mallikarjuna Rao for pointing us to the literature on rank-1 decompositions of matrices, and Prof. N. Khaneja for helpful discussions. B. D. O. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Methods. Courier Corporation, 2007. Optimal Dictionary for Least Squares Representation J. J. Benedetto and M. Fickus. Finite Normalized Tight Frames. Advances in Computational Mathematics, 18(2-4):357 385, 2003. D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 1. Athena Scientific Belmont, MA, 1995. R. Bhatia. Matrix Analysis, volume 169. Springer-Verlag, New York, 1997. R. Bhatia. Positive Definite Matrices. Princeton University Press, 2009. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. O. Christensen. An Introduction to Frames and Riesz Bases. Applied and Numerical Harmonic Analysis. Birkhäuser/Springer, [Cham], 2nd edition, 2016. F. Clarke. Functional Analysis, Calculus of Variations and Optimal Control, volume 264. Springer Science & Business Media, 2013. I. Daubechies, A. Grossmann, and Y. Meyer. Painless Nonorthogonal Expansions. Journal of Mathematical Physics, 27(5):1271 1283, 1986. K. K. Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee, and T. J. Sejnowski. Dictionary Learning Algorithms for Sparse Representation. Neural Computation, 15(2):349 396, 2003. D. Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press, 2012. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online Dictionary Learning for Sparse Coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689 696. ACM, 2009a. J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach. Supervised Dictionary Learning. In Advances in Neural Information Processing Systems, pages 1033 1040, 2009b. S. G. Mallat and Z. Zhang. Matching Pursuits With Time-Frequency Dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993. P. C. Müller and H. I. Weber. Analysis and optimization of certain qualities of controllability and observability for linear dynamical systems. Automatica, 8(3):237 246, 1972. B. A. Olshausen and D. J. Field. Sparse Coding With an Overcomplete Basis Set: A Strategy Employed by v1? Vision Research, 37(23):3311 3325, 1997. K. R. Parthasarathy. Probability Measures on Metric Spaces. AMS Chelsea Publishing, Providence, RI, 2005. Reprint of the 1967 original. F. Pasqualetti, S. Zampieri, and F. Bullo. Controllability metrics, limitations and algorithms for complex networks. IEEE Transactions on Control of Network Systems, 1(1):40 52, 2014. Mohammed Rayyan Sheriff and Debasish Chatterjee E. B. Saffand A. Kuijlaars. Distributing Many Points on a Sphere. The Mathematical Intelligencer, 19(1):5 11, 1997. M. R. Sheriffand D. Chatterjee. On a Frame Theoretic Measure of Quality of LTI Systems. ar Xiv preprint ar Xiv:1703.07539, 2017. K. Skretting and K. Engan. Recursive Least Squares Dictionary Learning Algorithm. IEEE Transactions on Signal Processing, 58(4):2121 2130, 2010. J. F. Sturm and S. Zhang. On Cones of Nonnegative Quadratic Functions. Mathematics of Operations Research, 28(2):246 267, 2003. I. Tošić and P. Frossard. Dictionary Learning. Signal Processing Magazine, IEEE, 28(2): 27 38, 2011. M. Yaghoobi, T. Blumensath, and M. E. Davies. Dictionary Learning for Sparse Approximations With the Majorization Method. IEEE Transactions on Signal Processing, 57(6): 2178 2191, 2009. M. Yang, L. Zhang, X. Feng, and D. Zhang. Fisher Discrimination Dictionary Learning for Sparse Representation. In IEEE International Conference on Computer Vision (ICCV), 2011, pages 543 550. IEEE, 2011. F. Zhang. Matrix Theory: Basic Results and Techniques. Springer Science & Business Media, 2011. G. Zimmermann. Normalized Tight Frames in Finite Dimensions. In Recent Progress in Multivariate Approximation, pages 249 252. Springer, 2001.