# approximate_supermodularity_bounds_for_experimental_design__255b8b19.pdf Approximate Supermodularity Bounds for Experimental Design Luiz F. O. Chamon and Alejandro Ribeiro Electrical and Systems Engineering University of Pennsylvania {luizf,aribeiro}@seas.upenn.edu This work provides performance guarantees for the greedy solution of experimental design problems. In particular, it focuses on Aand E-optimal designs, for which typical guarantees do not apply since the mean-square error and the maximum eigenvalue of the estimation error covariance matrix are not supermodular. To do so, it leverages the concept of approximate supermodularity to derive nonasymptotic worst-case suboptimality bounds for these greedy solutions. These bounds reveal that as the SNR of the experiments decreases, these cost functions behave increasingly as supermodular functions. As such, greedy Aand E-optimal designs approach (1 e 1)-optimality. These results reconcile the empirical success of greedy experimental design with the non-supermodularity of the Aand E-optimality criteria. 1 Introduction Experimental design consists of selecting which experiments to run or measurements to observe in order to estimate some variable of interest. Finding good designs is an ubiquitous problem with applications in regression, semi-supervised learning, multivariate analysis, and sensor placement [1 10]. Nevertheless, selecting a set of k experiments that optimizes a generic figure of merit is NPhard [11, 12]. In some situations, however, an approximate solution with optimality guarantees can be obtained in polynomial time. For example, this is possible when the cost function possesses a diminishing returns property known as supermodularity, in which case greedy search is nearoptimal. Greedy solutions are particularly attractive for large-scale problems due to their iterative nature and because they have lower computational complexity than typical convex relaxations [11, 12]. Supermodularity, however, is a stringent condition not met by important performance metrics. For instance, it is well-known that neither the mean-square error (MSE) nor the maximum eigenvalue of the estimation error covariance matrix are supermodular [1, 13, 14]. Nevertheless, greedy algorithms have been successfully used to minimize these functions despite the lack of theoretical guarantees. The goal of this paper is to reconcile these observations by showing that these figures of merit, used in Aand E-optimal experimental designs, are approximately supermodular. To do so, it introduces different measures of approximate supermodularity and derives near-optimality results for these classes of functions. It then bounds how much the MSE and the maximum eigenvalue of the error covariance matrix violate supermodularity, leading to performance guarantees for greedy Aand E-optimal designs. More to the point, the main results of this work are: 1. The greedy solution of the A-optimal design problem is within a multiplicative (1 e α) factor of the optimal with α [1 + O(γ)] 1, where γ upper bounds the signal-to-noise ratio (SNR) of the experiments (Theorem 3). 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. 2. The value of the greedy solution of an E-optimal design problem is at most (1 e 1)(f(D ) + kϵ), where ϵ O(γ) (Theorem 4). 3. As the SNR of the experiments decreases, the performance guarantees for greedy Aand E-optimal designs approach the classical 1 1/e. This last observation is particularly interesting since careful selection of experiments is more important in low SNR scenarios. In fact, unless experiments are highly correlated, designs have similar performances in high SNR. Also, note that the guarantees in this paper are not asymptotic and hold in the worst-case, i.e., hold for problems of any dimension and for designs of any size. Notation Lowercase boldface letters represent vectors (x), uppercase boldface letters are matrices (X), and calligraphic letters denote sets/multisets (A). We write #A for the cardinality of A and P(B) to denote the collection of all finite multisets of the set B. To say X is a positive semidefinite (PSD) matrix we write X 0, so that for X, Y Rn n, X Y b T Xb b T Y b, for all b Rn. Similarly, we write X 0 when X is positive definite. 2 Optimal experimental design Let E be a pool of possible experiments. The outcome of experiment e E is a multivariate measurement ye Rne defined as ye = Aeθ + ve, (1) where θ Rp is a parameter vector with a prior distribution such that E [θ] = θ and E(θ θ)(θ θ)T = Rθ 0; Ae is an ne p observation matrix; and ve Rne is a zero-mean random variable with arbitrary covariance matrix Re = E vev T e 0 that represents the experiment uncertainty. The {ve} are assumed to be uncorrelated across experiments, i.e., E vev T f = 0 for all e = f, and independent of θ. These experiments aim to estimate z = Hθ, (2) where H is an m p matrix. Appropriately choosing H is important given that the best experiments to estimate θ are not necessarily the best experiments to estimate z. For instance, if θ is to be used for classification, then H can be chosen so as to optimize the design with respect to the output of the classifier. Alternatively, transductive experimental design can be performed by taking H to be a collection of data points from a test set [6]. Finally, H = I, the identity matrix, recovers the classical θ-estimation case. The experiments to be used in the estimation of z are collected in a multiset D called a design. Note that D contains elements of E with repetitions. Given a design D, it is ready to compute an optimal Bayesian estimate ˆz D. The estimation error of ˆz D is measured by the error covariance matrix K(D). An expression for the estimator and its error matrix in terms of the problem constants is given in the following proposition. Proposition 1 (Bayesian estimator). Let the experiments be defined as in (1). For Me = AT e R 1 e Ae and a design D P(E), the unbiased affine estimator of z with the smallest error covariance matrix in the PSD cone is given by e D AT e R 1 e ye + R 1 θ θ The corresponding error covariance matrix K(D) = E (z ˆz D)(z ˆz D)T | θ, {Me}e D is given by the expression Proof. See extended version [15]. The experimental design problem consists of selecting a design D of cardinality at most k that minimizes the overall estimation error. This can be explicitly stated as the problem of choosing D with #D k that minimizes the error covariance K(D) whose expression is given in (4). Note that (4) can account for unregularized (non-Bayesian) experimental design by removing Rθ and using a pseudo-inverse [16]. However, the error covariance matrix is no longer monotone in this case see Lemma 1. Providing guarantees for this scenario is the subject of future work. The minimization of the PSD matrix K(D) in experimental design is typically attempted using scalarization procedures generically known as alphabetical design criteria, the most common of which are A-, D-, and E-optimal design [17]. These are tantamount to selecting different figures of merit to compare the matrices K(D). Our focus in this paper is mostly on Aand E-optimal designs, but we also consider D-optimal designs for comparison. A design D with k experiments is said to be A-optimal if it minimizes the estimation MSE which is given by the trace of the covariance matrix, minimize |D| k Tr h K(D) i Tr h HRθHT i (P-A) Notice that is customary to say a design is A-optimal when H = I in (P-A), whereas the notation V-optimal is reserved for the case when H is arbitrary [17]. We do not make this distinction here for conciseness. A design is E-optimal if instead of minimizing the MSE as in (P-A), it minimizes the largest eigenvalue of the covariance matrix K(D), i.e., minimize |D| k λmax h K(D) i λmax h HRθHT i . (P-E) Since the trace of a matrix is the sum of its eigenvalues, we can think of (P-E) as a robust version of (P-A). While the design in (P-A) seeks to reduce the estimation error in all directions, the design in (P-E) seeks to reduce the estimation error in the worst direction. Equivalently, given that λmax(X) = max u 2=1 u T Xu, we can interpret (P-E) with H = I as minimizing the MSE for an adversarial choice of z. A D-optimal design is one in which the objective is to minimize the log-determinant of the estimator s covariance matrix, minimize |D| k log det h K(D) i log det h HRθHT i . (P-D) The motivation for using the objective in (P-D) is that the log-determinant of K(D) is proportional to the volume of the confidence ellipsoid when the data are Gaussian. Note that the trace, maximum eigenvalue, and determinant of HRθHT in (P-A), (P-E), and (P-D) are constants and do not affect the respective optimization problems. They are subtracted so that the objectives vanish when D = , the empty set. This simplifies the exposition in Section 4. Although the problem formulations in (P-A), (P-E), and (P-D) are integer programs known to be NP-hard, the use of greedy methods for their solution is widespread with good performance in practice. In the case of D-optimal design, this is justified theoretically because the objective of (P-D) is supermodular, which implies greedy methods are (1 e 1)-optimal [2, 11, 12]. The objectives in (P-A) and (P-E), on the other hand, are not be supermodular in general [1, 13, 14] and it is not known why their greedy optimization yields good results in practice conditions for the MSE to be supermodular exist but are restrictive [1]. The goal of this paper is to derive performance guarantees for greedy solutions of Aand E-optimal design problems. We do so by developing different notions of approximate supermodularity to show that Aand E-optimal design problems are not far from supermodular. Remark 1. Besides its intrinsic value as a minimizer of the volume of the confidence ellipsoid, (P-D) is often used as a surrogate for (P-A), when A-optimality (MSE) is considered the appropriate metric. It is important to point out that this is only justified when the problem has some inherent structure that suggests the minimum volume ellipsoid is somewhat symmetric. Otherwise, since the volume of an ellipsoid can be reduced by decreasing the length of a single principal axis, using (P-D) can lead to designs that perform well in the MSE sense along a few directions of the parameter space and poorly along all others. Formally, this can be seen by comparing the variation of the log-determinant and trace functions with respect to the eigenvalues of the PSD matrix K, λj(K) = 1 λj(K) and Tr(K) The gradient of the log-determinant is largest in the direction of the smallest eigenvalue of the error covariance matrix. In contrast, the MSE gives equal weight to all directions of the space. The latter therefore yields balanced, whereas the former tends to flatten the confidence ellipsoid unless the problem has a specific structure. 3 Approximate supermodularity Consider a multiset function f : P(E) R for which the value corresponding to an arbitrary multiset D P(E) is denoted by f(D). We say the function f is normalized if f( ) = 0 and we say f is monotone decreasing if for all multisets A B it holds that f(A) f(B). Observe that if a function is normalized and monotone decreasing it must be that f(D) 0 for all D. The objectives of (P-A), (P-E), and (P-D) are normalized and monotone decreasing multiset functions, since adding experiments to a design decreases the covariance matrix uniformly in the PSD cone see Lemma 1. We say that a multiset function f is supermodular if for all pairs of multisets A, B P(E), A B, and elements u E it holds that f(A) f(A {u}) f(B) f(B {u}). Supermodular functions encode a notion of diminishing returns as the sets grow. Their relevance in this paper is due to the celebrated bound on the suboptimality of their greedy minimization [18]. Specifically, construct a greedy solution by starting with G0 = and incorporating elements (experiments) e E greedily so that at the h-th iteration we incorporate the element whose addition to Gh 1 results in the largest reduction in the value of f: Gh = Gh 1 {e}, with e = argmin u E f (Gh 1 {u}) . (5) The recursion in (5) is repeated for k steps to obtain a greedy solution with k elements. Then, if f is normalized, monotone decreasing, and supermodular, f(Gk) (1 e 1)f(D ), (6) where D argmin|D| k f(D) is the optimal design selection of cardinality not larger than k [18]. We emphasize that in contrast to the classical greedy algorithm, (5) allows the same element to be selected multiple times. The optimality guarantee in (6) applies to (P-D) because its objective is supermodular. This is not true of the cost functions of (P-A) and (P-E). We address this issue by postulating that if a function does not violate supermodularity too much, then its greedy minimization should have close to supermodular performance. To formalize this idea, we introduce two measures of approximate supermodularity and derive near-optimal bounds based on these properties. It is worth noting that as intuitive as it may be, such results are not straightforward. In fact, [19] showed that even functions δclose to supermodular cannot be optimized in polynomial time. We start with the following multiplicative relaxation of the supermodular property. Definition 1 (α-supermodularity). A multiset function f : P(E) R is α-supermodular, for α : N N R, if for all multisets A, B P(E), A B, and all u E it holds that f (A) f (A {u}) α(#A, #B) [f (B) f (B {u})] . (7) Notice that for α 1, (7) reduces the original definition of supermodularity, in which case we refer to the function simply as supermodular [11, 12]. On the other hand, when α < 1, f is said to be approximately supermodular. Notice that if f is decreasing, then (7) always holds for α 0. We are therefore interested in the largest α for which (7) holds, i.e., α(a, b) = min A,B P(E) A B, u E #A=a, #B=b f (A) f (A {u}) f (B) f (B {u}) (8) Interestingly, α not only measures how much f violates supermodularity, but it also quantifies the loss in performance guarantee incurred from these violations. Theorem 1. Let f be a normalized, monotone decreasing, and α-supermodular multiset function. Then, for α = mina<ℓ, b<ℓ+k α(a, b), the greedy solution from (5) obeys 1 1 Pk 1 s=0 α(h, h + s) 1 f(D ) (1 e αℓ/k)f(D ). (9) Proof. See extended version [15]. Theorem 1 bounds the suboptimality of the greedy solution from (5) when its objective is αsupermodular. At the same time, it quantifies the effect of relaxing the supermodularity hypothesis typically used to provide performance guarantees in these settings. In fact, if f is supermodular (α 1) and for ℓ= k, we recover the 1 e 1 0.63 guarantee from [18]. On the other hand, for an approximately supermodular function ( α < 1), the result in (9) shows that the 63% guarantee can be recovered by selecting a set of size ℓ= α 1k. Thus, α not only measures how much f violates supermodularity, but also gives a factor by which the cardinality constraint must be violated to obtain a supermodular near-optimal certificate. Similar to the original bound in [18], it worth noting that (9) is not tight and that better results are typical in practice (see Section 5). Although α-supermodularity gives a multiplicative approximation factor, finding meaningful bounds on α can be challenging for certain multiset functions, such as the E-optimality criterion in (P-E). It is therefore useful to look at approximate supermodularity from a different perspective as in the following definition. Definition 2 (ϵ-supermodularity). A multiset function f : P(E) R is ϵ-supermodular, for ϵ : N N R, if for all multisets A, B P(E), A B, and all u E it holds that f (A) f (A {u}) f (B) f (B {u}) ϵ (#A, #B) . (10) Again, we say f is supermodular if ϵ(a, b) 0 for all a, b and approximately supermodular otherwise. As with α, we want the best ϵ that satisfies (10), which is given by ϵ (a, b) = max A,B P(E) A B, u E #A=a, #B=b f (B) f (B {u}) f (A) + f (A {u}) . (11) In contrast to α-supermodularity, we obtain an additive approximation guarantee for the greedy minimization of ϵ-supermodular functions. Theorem 2. Let f be a normalized, monotone decreasing, and ϵ-supermodular multiset function. Then, for ϵ = maxa<ℓ, b<ℓ+k ϵ(a, b), the greedy solution from (5) obeys h=0 ϵ(h, h + s) 1 1 (1 e ℓ/k)(f(D ) + k ϵ) Proof. See extended version [15]. As before, ϵ quantifies the loss in performance guarantee due to relaxing supermodularity. Indeed, (12) reveals that ϵ-supermodular functions have the same guarantees as a supermodular function up to an additive factor of Θ(k ϵ). In fact, if ϵ (ek) 1|f(D )| (recall that f(D ) 0 due to normalization), then taking ℓ= 3k recovers the 63% approximation factor of supermodular functions. This same factor is obtained for α 1/3-supermodular functions. With the certificates of Theorems 1 and 2 in hand, we now proceed with the study of the Aand Eoptimality criteria. In the next section, we derive explicit bounds on their αand ϵ-supermodularity, respectively, thus providing near-optimal performance guarantees for greedy Aand E-optimal designs. 4 Near-optimal experimental design Theorems 1 and 2 apply to functions that are (i) normalized, (ii) monotone decreasing, and (iii) approximately supermodular. By construction, the objectives of (P-A) and (P-E) are normalized [(i)]. The following lemma establishes that they are also monotone decreasing [(ii)] by showing that K is a decreasing set function in the PSD cone. The definition of Loewner order and the monotonicity of the trace operator readily give the desired results [16]. Lemma 1. The matrix-valued set function K(D) in (4) is monotonically decreasing with respect to the PSD cone, i.e., A B K(A) K(B). Proof. See extended version [15]. The main results of this section provide the final ingredient [(iii)] for Theorems 1 and 2 by bounding the approximate supermodularity of the Aand E-optimality criteria. We start by showing that the objective of (P-A) is α-supermodular. Theorem 3. The objective of (P-A) is α-supermodular with α(a, b) 1 κ(H)2 λmin R 1 θ λmax R 1 θ + a ℓmax , for all b N, (13) where ℓmax = maxe E λmax(Me), Me = AT e R 1 e Ae, and κ(H) = σmax / σmin is the ℓ2-norm condition number of H, with σmax and σmin denoting the largest and smallest singular values of H respectively. Proof. See extended version [15]. Theorem 3 bounds the α-supermodularity of the objective of (P-A) in terms of the condition number of H, the prior covariance matrix, and the measurements SNR. To facilitate the interpretation of this result, let the SNR of the e-th experiment be γe = Tr[Me] and suppose Rθ = σ2 θI, H = I, and γe γ for all e E. Then, for ℓ= k greedy iterations, (13) implies α 1 1 + 2kσ2 θγ , for α as in Theorem 1. This deceptively simple bound reveals that the MSE behaves as a supermodular function at low SNRs. Formally, α 1 as γ 0. In contrast, the performance guarantee from Theorem 3 degrades in high SNR scenarios. In this case, however, greedy methods are expected to give good results since designs yield similar estimation errors (as illustrated in Section 5). The greedy solution of (P-A) also approaches the 1 1/e guarantee when the prior on θ is concentrated (σ2 θ 1), i.e., when the problem is heavily regularized. These observations also hold for a generic H as long as it is well-conditioned. Even if κ(H) 1, we can replace H by H = DH for some diagonal matrix D 0 without affecting the design, since z is arbitrarily scaled. The scaling D can be designed to minimize the condition number of H by leveraging preconditioning and balancing methods [20, 21]. Proceeding, we derive guarantees for E-optimal designs using ϵ-supermodularity. Theorem 4. The cost function of (P-E) is ϵ-supermodular with ϵ(a, b) (b a) σmax(H)2 λmax (Rθ)2 ℓmax, (14) where ℓmax = maxe E λmax(Me), Me = AT e R 1 e Ae, and σmax(H) is the largest singular value of H. Proof. See extended version [15]. Under the same assumptions as above, Theorem 4 gives -30 -20 -10 0 10 20 0 40 60 80 100 200 Design size Unnormalized A-optimality 40 60 80 100 0 Design size Unnormalized A-optimality Truncation Random Figure 1: A-optimal design: (a) Thm. 3; (b) A-optimality (low SNR); (c) A-optimality (high SNR). The plots show the unnormalized A-optimality value for clarity. -30 -20 -10 0 10 20 10-3 40 60 80 100 20 Design size Unnormalized E-optimality 40 60 80 100 0 Design size Unnormalized E-optimality Figure 2: E-optimal design: (a) Thm. 4; (b) E-optimality (low SNR); (c) E-optimality (high SNR). The plots show the unnormalized E-optimality value for clarity. for ϵ as in Theorem 2. Thus, ϵ 0 as γ 0. In other words, the behavior of the objective of (P-E) approaches that of a supermodular function as the SNR decreases. The same holds for concentrated priors, i.e., limσ2 θ 0 ϵ = 0. Once again, it is worth noting that when the SNRs of the experiments are large, almost every design has the same E-optimal performance as long as the experiments are not too correlated. Thus, greedy design is also expected to give good results under these conditions. Finally, the proofs of Theorems 3 and 4 suggest that better bounds can be found when the designs are constructed without replacement, i.e., when only one of each experiment is allowed in the design. 5 Numerical examples In this section, we illustrate the previous results in some numerical examples. To do so, we draw the elements of Ae from an i.i.d. zero-mean Gaussian random variable with variance 1/p and p = 20. The noise {ve} are also Gaussian random variables with Re = σ2 v I. We take σ2 v = 10 1 in high SNR and σ2 v = 10 in low SNR simulations. The experiment pool contains #E = 200 experiments. Starting with A-optimal design, we display the bound from Theorem 3 in Figure 1a for multivariate measurements of size ne = 5 and designs of size k = 40. Here, equivalent α is the single ˆα that gives the same near-optimal certificate (9) as using (13). As expected, ˆα approaches 1 as the SNR decreases. In fact, for 10 d B is is already close to 0.75 which means that by selecting a design of size ℓ= 55 we would be within 1 1/e of the optimal design of size k = 40. Figures 1b and 1c compare greedy A-optimal designs with the convex relaxation of (P-A) in low and high SNR scenarios. The designs are obtained from the continuous solutions using the hard constraint, with replacement method of [10] and a simple design truncation as in [22]. Therefore, these simulations consider univariate measurements (ne = 1). For comparison, a design sampled uniformly at random with replacement from E is also presented. Note that, as mentioned before, the performance difference across designs is small for high SNR notice the scale in Figures 1c and 2c , so that even random designs perform well. For the E-optimality criterion, the bound from Theorem 4 is shown in Figure 2a, again for multivariate measurements of size ne = 5 and designs of size k = 40. Once again, equivalent ϵ is the single value ˆϵ that yields the same guarantee as using (14). In this case, the bound degradation in high SNR is more pronounced. This reflects the difficulty in bounding the approximate supermodularity of the E-optimality cost function. Still, Figures 2b and 2c show that greedy E-optimal designs have good performance when compared to convex relaxations or random designs. Note that, though it is not intended for E-optimal designs, we again display the results of the sampling post-processing from [10]. In Figure 2b, the random design is omitted due to its poor performance. 5.1 Cold-start survey design for recommender systems Recommender systems use semi-supervised learning methods to predict user ratings based on few rated examples. These methods are useful, for instance, to streaming service providers who are interested in using predicted ratings of movies to provide recommendations. For new users, these systems suffer from a cold-start problem, which refers to the fact that it is hard to provide accurate recommendations without knowing a user s preference on at least a few items. For this reason, services explicitly ask users for ratings in initial surveys before emitting any recommendation. Selecting which movies should be rated to better predict a user s preferences can be seen as an experimental design problem. In the following example, we use a subset of the Each Movie dataset [23] to illustrate how greedy experimental design can be applied to address this problem. We randomly selected a training and test set containing 9000 and 3000 users respectively. Following the notation from Section 2, each experiment in E represents a movie (|E| = 1622) and the observation vector Ae collects the ratings of movie e for each user in the training set. The parameter θ is used to express the rating of a new user in term of those in the training set. Our hope is that we can extrapolate the observed ratings, i.e., {ye}e D, to obtain the rating for a movie f / D as ˆyf = Af ˆθ. Since the mean absolute error (MAE) is commonly used in this setting, we choose to work with the A-optimality criterion. We also let H = I and take a non-informative prior θ = 0 and Rθ = σ2 θI with σ2 θ = 100. As expected, greedy A-optimal design is able to find small sets of movies that lead to good prediction. For k = 10, for example, MAE = 2.3, steadily reducing until MAE < 1.8 for k 35. These are considerably better results than a random movie selection, for which the MAE varies between 2.8 and 3.3 for k between 10 and 50. Instead of focusing on the raw ratings, we may be interested in predicting the user s favorite genre. This is a challenging task due to the heavily skewed dataset. For instance, 32% of the movies are dramas whereas only 0.02% are animations. Still, we use the simplest possible classifier by selecting the category with highest average estimated ratings. By using greedy design, we can obtain a misclassification rate of approximately 25% by observing 100 ratings, compared to over 45% error rate for a random design. 6 Related work Optimal experimental design Classical experimental design typically relies on convex relaxations to solve optimal design problems [17, 22]. However, because these are semidefinite programs (SDPs) or sequential second-order cone programs (SOCPs), their computational complexity can hinder their use in large-scale problems [5, 7, 22, 24]. Another issue with these relaxations is that some sort of post-processing is required to extract a valid design from their continuous solutions [5, 22]. For D-optimal designs, this can be done with (1 e 1)-optimality [25, 26]. For A-optimal designs, [10] provides near-optimal randomized schemes for large enough k. Greedy optimization guarantees The (1 e 1)-suboptimality of greedy search for supermodular minimization under cardinality constraints was established in [18]. To deal with the fact that the MSE is not supermodular, α-supermodularity with constant α was introduced in [27] along with explicit lower bounds. This concept is related to the submodularity ratio introduced by [3] to obtain guarantees similar to Theorem 1 for dictionary selection and forward regression. However, the bounds on the submodularity ratio from [3, 28] depend on the sparse eigenvalues of K or restricted strong convexity constants of the A-optimal objective, which are NP-hard to compute. Explicit bounds for the submodularity ratio of A-optimal experimental design were recently obtained in [29]. Nevertheless, neither [27] nor [29] consider multisets. Hence, to apply their results we must operate on an extended ground set containing k unique copies of each experiment, which make the bounds uninformative. For instance, in the setting of Section 5, Theorem 3 guarantees 0.1-optimality at 0 d B SNR whereas [29] guarantees 2.5 10 6-optimality. The concept of ϵ-supermodularity was first explored in [30] for a constant ϵ. There, guarantees for dictionary selection were derived by bounding ϵ using an incoherence assumption on the Ae. Finally, a more stringent definition of approximately submodular functions was put forward in [19] by requiring the function to be upper and lower bounded by a submodular function. They show strong impossibility results unless the function is O(1/k)-close to submodular. Approximate submodularity is sometimes referred to as weak submodularity (e.g., [28]), though it is not related to the weak submodularity concept from [31]. 7 Conclusions Greedy search is known to be an empirically effective method to find Aand E-optimal experimental designs despite the fact that these objectives are not supermodular. We reconciled these observations by showing that the Aand E-optimality criteria are approximately supermodular and deriving nearoptimal guarantees for this class of functions. By quantifying their supermodularity violations, we showed that the behavior of the MSE and the maximum eigenvalue of the error covariance matrix becomes increasingly supermodular as the SNR decreases. An important open question is whether these results can be improved using additional knowledge. Can we exploit some structure of the observation matrices (e.g., Fourier, random)? What if the parameter vector is sparse but with unknown support (e.g., compressive sensing)? Are there practical experiment properties other than the SNR that lead to small supermodular violations? Finally, we hope that this approximate supermodularity framework can be extended to other problems. Acknowledgments This work was supported by the National Science Foundation CCF 1717120 and in part by the ARO W911NF1710438. [1] A. Das and D. Kempe, Algorithms for subset selection in linear regression, in ACM Symp. on Theory of Comput., 2008, pp. 45 54. [2] A. Krause, A. Singh, and C. Guestrin, Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies, J. Mach. Learning Research, vol. 9, pp. 235 284, 2008. [3] A. Das and D. Kempe, Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection, in Int. Conf. on Mach. Learning, 2011. [4] Y. Washizawa, Subset kernel principal component analysis, in Int. Workshop on Mach. Learning for Signal Process., 2009. [5] S. Joshi and S. Boyd, Sensor selection via convex optimization, IEEE Trans. Signal Process., vol. 57[2], pp. 451 462, 2009. [6] K. Yu, J. Bi, and V. Tresp, Active learning via transductive experimental design, in International Conference on Machine Learning, 2006, pp. 1081 1088. [7] P. Flaherty, A. Arkin, and M.I. Jordan, Robust design of biological experiments, in Advances in Neural Information Processing Systems, 2006, pp. 363 370. [8] X. Zhu, Semi-supervised learning literature survey, 2008, http://pages.cs.wisc.edu/ ~jerryzhu/research/ssl/semireview.html. [9] S. Liu, S.P. Chepuri, M. Fardad, E. Ma sazade, G. Leus, and P.K. Varshney, Sensor selection for estimation with correlated measurement noise, IEEE Trans. Signal Process., vol. 64[13], pp. 3509 3522, 2016. [10] Y. Wang, A.W. Yu, and A. Singh, On computationally tractable selection of experiments in regression models, 2017, ar Xiv:1601.02068v5. [11] F. Bach, Learning with submodular functions: A convex optimization perspective, Foundations and Trends in Machine Learning, vol. 6[2-3], pp. 145 373, 2013. [12] A. Krause and D. Golovin, Submodular function maximization, in Tractability: Practical Approaches to Hard Problems. Cambridge University Press, 2014. [13] G. Sagnol, Approximation of a maximum-submodular-coverage problem involving spectral functions, with application to experimental designs, Discrete Appl. Math., vol. 161[1-2], pp. 258 276, 2013. [14] T.H. Summers, F.L. Cortesi, and J. Lygeros, On submodularity and controllability in complex dynamical networks, IEEE Trans. Contr. Netw. Syst., vol. 3[1], pp. 91 101, 2016. [15] L. F. O. Chamon and A. Ribeiro, Approximate supermodularity bounds for experimental design, 2017, ar Xiv:1711.01501. [16] R.A. Horn and C.R. Johnson, Matrix analysis, Cambridge University Press, 2013. [17] F. Pukelsheim, Optimal Design of Experiments, SIAM, 2006. [18] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher, An analysis of approximations for maximizing submodular set functions I, Mathematical Programming, vol. 14[1], pp. 265 294, 1978. [19] T. Horel and Y. Singer, Maximization of approximately submodular functions, in Advances in Neural Information Processing Systems, 2016, pp. 3045 3053. [20] M. Benzi, Preconditioning techniques for large linear systems: A survey, Journal of Computational Physics, vol. 182[2], pp. 418 477, 2002. [21] R.D. Braatz and M. Morari, Minimizing the Euclidian condition number, SIAM Journal of Control and Optimization, vol. 32[6], pp. 1763 1768, 1994. [22] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004. [23] Digital Equipment Corporation, Each Movie dataset, http://www.gatsby.ucl.ac.uk/ ~chuwei/data/Each Movie/. [24] G. Sagnol, Computing optimal designs of multiresponse experiments reduces to second-order cone programming, Journal of Statistical Planning and Inference, vol. 141[5], pp. 1684 1708, 2011. [25] T. Horel, S. Ioannidis, and M. Muthukrishnan, Budget feasible mechanisms for experimental design, in Latin American Theoretical Informatics Symposium, 2014. [26] A.A. Ageev and M.I. Sviridenko, Pipage rounding: A new method of constructing algorithms with proven performance guarantee, Journal of Combinatorial Optimization, vol. 8[3], pp. 307 328, 2004. [27] L.F.O. Chamon and A. Ribeiro, Near-optimality of greedy set selection in the sampling of graph signals, in Global Conf. on Signal and Inform. Process., 2016. [28] E.R. Elenberg, R. Khanna, A.G. Dimakis, and S. Negahban, Restricted strong convexity implies weak submodularity, 2016, ar Xiv:1612.00804. [29] A. Bian, J.M. Buhmann, A. Krause, and S. Tschiatschek, Guarantees for greedy maximization of non-submodular functions with applications, in ICML, 2017. [30] A. Krause and V. Cevher, Submodular dictionary selection for sparse representation, in Int. Conf. on Mach. Learning, 2010. [31] A. Borodin, D. T. M. Le, and Y. Ye, Weakly submodular functions, 2014, ar Xiv:1401.6697v5.