# nearoptimal_design_of_experiments_via_regret_minimization__1d52a2b6.pdf

Near-Optimal Design of Experiments via Regret Minimization

Zeyuan Allen-Zhu * 1 Yuanzhi Li * 2 Aarti Singh * 3 Yining Wang * 3

We consider computationally tractable methods for the experimental design problem, where k out of n design points of dimension p are selected so that certain optimality criteria are approximately satisﬁed. Our algorithm ﬁnds a (1 + ε)- approximate optimal design when k is a linear function of p; in contrast, existing results require k to be super-linear in p. Our algorithm also handles all popular optimality criteria, while existing ones only handle one or two such criteria. Numerical results on synthetic and real-world design problems verify the practical effectiveness of the proposed algorithm.

1. Introduction

Experimental design is an important problem in statistics and machine learning research (Pukelsheim, 2006). Consider a linear regression model

y = Xβ0 + w, (1)

where X Rn p is a pool of n design points, y is the response vector, β0 is a p-dimensional unknown regression model and w is a vector of i.i.d. noise variables satisfying Ewi = 0 and Ew2 i < . The experimental design problem is to select a small subset of rows (i.e., design points) XS from the design pool X so that the statistical power of estimating β0 is maximized from noisy response y S on the selected designs XS.

As an example, consider a material synthesis application where p is the number of variables (e.g., temperature, pressure, duration) that are hypothesized to affect the quality of the synthesized material and n is the total number of combinations of different parameters of experimental conditions. As experiments are expensive and time-consuming,

*Author names listed in alphabetic order. 1Microsoft Research, Redmond, USA 2Princeton University, Princeton, USA 3Carnegie Mellon University, Pittsburgh, USA. Correspondence to: Yining Wang <yiningwa@cs.cmu.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

one wishes to select k n experimental settings from X that are the most statistically efﬁcient for establishing a model that connects experimental parameters with synthesized material quality, y. The experimental design problem is also related to many machine learning tasks, such as linear bandits (Deshpande & Montanari, 2012; Huang et al., 2016), diversity sampling (Kulesza & Taskar, 2012) and active learning (Ma et al., 2013; Chaudhuri et al., 2015; Hazan & Karnin, 2015; Balcan & Long, 2013; Wang & Singh, 2016).

Since statistical efﬁciency can be measured in various ways, there exist a number of optimality criteria to guide the selection of experiments. We review some optimality criteria in Sec. 2 and interested readers are referred to Sec. 6 of (Pukelsheim, 2006) for a comprehensive review.

Typically, an optimality criterion is a function f : S+ p R that maps from the p-dimensional positive deﬁnite cone to a real number. The experimental design problem can then be formulated as a combinatorial optimization problem:

S (k) = arg min S S(n,k) f(X S XS), (2)

where S is either a set or a multi-set of size k, and XS Rk p is formed by stacking the rows of X that are in S. The constraint set S1/2(n, k) is deﬁned as follows:

1. With replacement: S1(n, k) = {S multi-set : S [n], |S| k}. Under this setting, XS may contain duplicate rows of the design pool X;

2. Without replacement: S2(n, k) = {S standard set : S [n], |S| k}. Under this setting, XS only contains distinct rows of the design pool X.

The with replacement setting is classical in statistics literature, where the multiple measurements in y with respect to the same design point lead to different values with statistically independent noise. The without replacement setting, on the other hand, is more relevant in machine learning applications, because labels are not likely to change if the same data point (e.g., the same image) is considered twice. Finally, it is worth pointing out that the with replacement setting is easier, because it can be reduced (in polynomial time) to the without replacement setting by replicating each row of X for k times.

For many popular choices of f, the exact optimization

Near-optimal design of experiments via regret minimization

Table 1. Comparison with existing results on computationally efﬁcient experimental design. An algorithm produces a subset ˆS of size k, and the approximation ratio is deﬁned as f(X ˆ S X ˆ S)/min S Sb(n,r) f(X S XS) for r k. TOPT denotes the time complexity for solving the continuous convex optimization problem in Eq. (6).

Without replacement? Deterministic? Time complexity Constraints Criteria Approx. ratio

Pipage rounding Yes Yes TOPT + O(n2p2) k = r p D,T (1 1/e) 1

(Bouhtou et al., 2010) Yes No TOPT + O(n) k = r p D (n/k)1/p

(Avron & Boutsidis, 2013) Yes Yes O(n2p2) k = r p A n p+1 k p+1

(Wang et al., 2016) No No TOPT + O(nk) k = Ω(r), and r = Ω((p log p)/ε2) A,V 1 + ε

(Wang et al., 2016) Yes Yes TOPT + O(p6) k = r = Ω(p2/ε) A,V 1 + ε

This paper Eq. (3) No Yes TOPT + O(nkp2) k = r = Ω(p/ε2) A,D,T,E,V,G 1 + ε This paper Eq. (4) Yes Yes TOPT + O(nkp2) k = r > 2p A,D,T,E,V,G O(1) This paper Eq. (5) Yes Yes TOPT + O(nkp2) k = Ω(r), r p/ε2 A,D,T,E,V,G 1 + ε

In O( ) and Ω( ) we hide logarithmic dependency over n, p and k.

problem in Eq. (2) is NP-hard (C ivril & Magdon-Ismail, 2009; ˇCern y & Hlad ık, 2012). In this paper, we propose a computationally tractable algorithm that approximately computes Eq. (2) for a wide range of optimality criteria, and under very weak conditions on n, k and p.

Below is our main theorem:

Theorem 1.1. Suppose b {1, 2}, n > k > p and let f : S+ p R be a regular optimality criterion (cf. Deﬁnition 2.1). There exists a polynomial-time algorithm that outputs ˆS Sb(n, k) for any input matrix X Rn p with full column rank, and ˆS satisﬁes the following:

1. For b = 1 (with replacement), there exists an absolute constant C0 32 such that, for any ε (0, 1), if k C0p/ε2 then

f(X ˆS X ˆS) (1 + ε) min S S1(n,k) f(X S XS) . (3)

2. For b = 2 (without replacement) and any ξ > 2, there exists constant C1(ξ) > 0 depending only on ξ such that, if k ξp then

f(X ˆS X ˆS) C1(ξ) min S S2(n,k) f(X S XS) . (4)

Moreover, for ξ 4 we have C1(ξ) 32.

3. For b = 2 (without replacement) and any ε (0, 1/2), if k, r satisfy k 4(1 + 7ε)r and r p/ε2, then

f(X ˆS X ˆS) (1 + ε) min S S2(n,r) f(X S XS). (5)

We interpret the signiﬁcance of Theorem 1.1 as follows.

Under a very mild condition of k > 2p, our polynomialtime algorithm ﬁnds a set ˆS [n] of size k, with objective value f(X ˆS X ˆS) being at most O(1) a constant times the optimum. See Eq. (4).

If replacement (b = 1) or over-sampling (k > r) is allowed, the approximation ratio can be tightened to 1+ ε for arbitrarily small ε > 0. See Eq. (3) and (5).

In all of the three cases, we only require k to grow linearly in p. Recall that k p is necessary to ensure the singularity of X ˆS X ˆS. In contrast, no polynomialtime algorithm has achieved O(1) approximation in the regime k = O(p) for non-submodular optimality criteria (e.g., Aand V-optimality) under the without replacement setting.

Our algorithm works for any regular optimality criterion. To the best of our knowledge, no known polynomial-time algorithm can achieve a (1 + ε) approximation for the Dand T-optimality criteria, or even an O(1) approximation for the Eand G-optimality criteria. See Table 1 for a comparison.

The key idea behind our proof of Theorem 1.1 is a regret minimization characterization of the least eigenvalue of positive semideﬁnite (PSD) matrices. Similar ideas were developed in (Allen-Zhu et al., 2015; Silva et al., 2016) to construct efﬁcient algorithms for linear-sized graph sparsiﬁers. In this paper we adopt the regret minimization framework and present novel potential function analysis for the speciﬁc application of experimental design.

1.1. Notations

S+ p is the positive deﬁnite cone of p p matrices: a p p symmetric matrix A belongs to S+ p if and only if v Av > 0 for all v Rp\{0}. For symmetric matrices A and B, we write A B if v (A B)v 0 for all v Rp. The inner product A, B is deﬁned as A, B = tr(B A) = Pp i,j=1 Aij Bij. We use A 2 = supv Rp\{0} Av 2/ v 2 to denote the spectral

norm, and A F = q Pp i,j=1 A2 ij = p

A, A to denote

Near-optimal design of experiments via regret minimization

the Frobenius norm of A. For A 0, we write B = A1/2

as the unique B 0 that satisﬁes B2 = A. For a design pool X Rn p, we use xi Rp to denote the i-th row of X. We use σmin(A) for the least (smallest) singular value of a PSD matrix A.

1.2. Related work

Experimental design is an old topic in statistics research (Pukelsheim, 2006; Fedorov, 1972). Computationally efﬁcient experimental design algorithms (with provable guarantees) are, however, a less studied ﬁeld. In the case of submodular optimality criteria (e.g., Dand T-optimality), the classical pipage rounding method (Ageev & Sviridenko, 2004; Horel et al., 2014; Ravi et al., 2016) combined with semi-deﬁnite programming results in computationally efﬁcient algorithms that enjoy a constant approximation ratio. Bouhtou et al. (2010) improves the approximation ratio when k is very close to n. Deshpande & Rademacher (2010); Li et al. (2017) considered polynomial-time algorithms for sampling from a D-optimality criterion. These algorithms are not applicable to non-submodular criteria, such as A-, V-, Eor G-optimality.

For the particular A-optimality criterion, (Avron & Boutsidis, 2013) proposed a greedy algorithm with an approximation ratio of O(n/k) with respect to f(X X). It was shown that in the worst case min|S| k f(X S XS) O(n/k) f(X X) and hence the bound is tight. However, for general design pool min|S| k f(X S XS) could be far smaller than O(n/k) f(X X), making the theoretical results powerless in such scenarios. Wang et al. (2016) considered a variant of the greedy method and showed an approximation ratio quadratic in design dimension p and independent of pool size n.

Wang et al. (2016) derived algorithms based on effective resistance sampling (Spielman & Srivastava, 2011) that attain (1 + ε) approximation ratio if k = Ω(p log p/ε2) and repetitions of design points are allowed. The algorithm fundamentally relies on the capability of re-weighting (repeating) design points and cannot be adapted to the more general without replacement setting. Naive sampling based methods were considered in (Wang et al., 2016; Chaudhuri et al., 2015; Dhillon et al., 2013), which also achieve (1+ε) approximation but requires the subset size k to be much larger than the condition number of X.

A related however different topic is low-rank matrix column subset selection and CUR approximation, which seeks column subset C and row subset R such that X CC X F and/or X CUR F are minimized (Drineas et al., 2008; Boutsidis & Woodruff, 2014; Wang & Singh, 2015b; Drineas & Mahoney, 2005; Wang & Zhang, 2013; Wang & Singh, 2015a). These problems are unsupervised in nature and do not in general correspond to statistical

properties under supervised regression settings. Pilanci & Wainwright (2016); Raskutti & Mahoney (2014); Woodruff (2014) considered fast methods for solving ordinary least squares (OLS) problems. They are computationally oriented and typically require knowledge of the full response vector y, which is different from the experimental design problem.

2. Regular criteria and continuous relaxation

We start with the deﬁnition of regular optimality criteria:

Deﬁnition 2.1 (Regular criteria). An optimality criterion f : S+ p R is regular if it satisﬁes the followig properties:

1. Convexity: 1 f(λA + (1 λ)B) λf(A) + (1 λ)f(B) for all λ [0, 1] and A, B S+ p ;

2. Monotonicity: If A B then f(A) f(B);

3. Reciprocal multiplicity: f(t A) = t 1f(A) for all t > 0 and A S+ p .

Almost all optimality criteria used in the experimental design literature are regular. Below we list a few popular examples; their statistical implications can be found in (Pukelsheim, 2006):

- A-optimality (Average): f A(Σ) = 1

- D-optimality (Determinant): f D(Σ) = (det |Σ|) 1

- T-optimality (Trace): f T (Σ) = p/tr(Σ);

- E-optimality (Eigenvalue): f E(Σ) = Σ 1 2;

- V-optimality (Variance): f V (Σ) = 1

ntr(XΣ 1X );

- G-optimality: f G(Σ) = max diag(XΣ 1X ).

The (A-, D-, T-, E-) criteria concern estimates of regression coefﬁcients and the (V-, G-) criteria are about insample predictions. All criteria listed above are regular. Note that for D-optimality the proxy function g D(Σ) = log det(Σ) is considered to satisfy the convexity property. In addition, by the standard arithmetic inequality we have that f T f D f A f E and that f V f G.

Although exact optimization of the combinatorial problem Eq. (2) is intractable, it is nevertheless easy to solve a continuous relaxation of Eq. (2) given the convexity property in Deﬁnition 2.1. We consider the following continuous optimization problem:

π (b) = arg min π=(π1, ,πn) f

i=1 πixix i

s.t. π 0, π 1 r, I[b = 2] π 1.

1This property could be relaxed to allow a proxy function g : S+ p R being convex, where g(A) g(B) f(A) f(B).

Near-optimal design of experiments via regret minimization

The π 1 r constraint makes sure only r rows of X are selected , where r k is a parameter that controls the degree of oversampling. The 0 πi 1 constraint enforces that each row of X is selected at most once and is only applicable to the without replacement setting (b = 2). Eq. (6) is a relaxation of the original combinatorial problem Eq. (2), which we formalize below: Fact 2.1. For b {1, 2} we have f(Pn i=1 π i (b)xix i ) min S Sb(n,r) f(X S XS)

In addition, because of the monotonicity property of f the sum constraint must bind: Fact 2.2. For b {1, 2} it holds that Pn i=1 π i (b) = r.

Proofs of Facts 2.1 and 2.2 are straightforward and are placed in the supplementary material.

Both the objective function and the constraint set in Eq. (6) are convex, and hence it can be efﬁciently solved to global optimality by conventional convex optimization algorithms. In particular, for differentiable f we suggest the following projected gradient descent (PGD) procedure:

π(t+1) = PC π(t) γt f(π(t)) , (7)

where PC(x) = arg miny C x y 2 is the projection operator onto the feasible set C = {π Rp : π 0, π 1 r, I[b = 2] π 1} and {γt}t 1 > 0 is a sequence of step sizes typically chosen by backtracking line search. When f is not differentiable everywhere, projected subgradient descent could be used with either constant or diminishing step sizes. We defer detailed gradient computations to the supplementary material. It was shown in (Wang et al., 2016; Su et al., 2012) that the projection operator PC(x) could be efﬁciently computed up to precision δ in O(n log( x /δ)) operations.

3. Sparsiﬁcation via regret minimization

The optimal solution π of Eq. (6) does not naturally lead to a valid approximation of the combinatorial problem in Eq. (2), because the number of non-zero components in π

may far exceed k. The primary focus of this section is to design efﬁcient algorithms that sparsify the optimal solution π into s [k]n (with replacement) or s {0, 1}n

(without replacement), while at the same time bounding the increase in the objective.

Due to the monotonicity and reciprocal multiplicity properties of f, it sufﬁces to ﬁnd a sparsiﬁer s that satisﬁes n X

i=1 sixix i

i=1 π i xix i

for some constant τ (0, 1). By Deﬁnition 2.1, Eq. (8) immediately implies f(Pn i=1 sixix i )

τ 1f(Pn i=1 π i xix i ). The key idea behind our algorithm is a regret-minimization interpretation of the least eigenvalue of a positive deﬁnite matrix, which arises from recent progress in the spectral graph sparsiﬁcation literature (Silva et al., 2016; Allen-Zhu et al., 2015).

In the rest of this section, we adopt the notation that Π = diag(π ) and S = diag(s), both being n n non-negative diagonal matrices. We also use I to denote the identity matrix, whose dimension should be clear from the context.

3.1. The whitening trick

Consider the linear transform xi 7 (XΠX ) 1/2xi =: xi. It is easy to verify that Pn i=1 π i xi x i = I. Such a transform is usually referred to as whitening, because the sample covariance of the transformed data is the identity matrix. Deﬁne W = Pn i=1 si xi x i . We then have the following:

Proposition 3.1. For τ > 0, W τI if and only if (Pn i=1 sixix i ) τ(Pn i=1 π i xix i ).

Proof. The proposition holds because W τI if and only if (XΠX )1/2W(XΠX )1/2 τXΠX , and that (XΠX )1/2W(XΠX )1/2 = XSX .

Proposition 3.1 shows that, without loss of generality, we may assume Pn i=1 π i xix i = XΠX = I. The question of proving W = XSX τI is then reduced to lower bounding the smallest eigenvalue of W.

Recall that W can be written as a sum of rank-1 PSD matrices W = Pk t=1 Ft, where Ft = xix i for some i [n]. In the next section we give a novel characterization of the least eigenvalue of W from a regret minimization perspective. The problem of lower bounding the least eigenvalue of W can then be reduced to bounding the regret of a particular Follow-The-Regularized-Leader (FTRL) algorithm, which is a much easier task as FTRL admits closed-form solutions.

3.2. Smallest eigenvalue as regret minimization

We ﬁrst review the concept of regret minimization in a classical linear bandit setting. Let p = {A Rp p : A 0, tr(A) = 1} be an action space that consists of positive semi-deﬁnite matrices of dimension p and unit trace norm. Consider the linear bandit problem, which operates in k iterations. At iteration t, the player chooses an action At p; afterwards, a reference action Ft 0 is observed and the loss Ft, At is incurred. The objective of the player is to minimize his/her regret:

R({At}k t=1) :=

t=1 Ft, At inf U p

t=1 Ft, U ,

Near-optimal design of experiments via regret minimization

which is the excess loss of {At}k t=1 compared to the single optimal action U p in hindsight, knowing all the reference actions {Ft}k t=1. A popular algorithm for regret minimization is Follow-The-Regularized-Leader (FTRL), also known to be equivalent to Mirror Descent (MD) (Mc Mahan, 2011), which solves for

At = arg min A p

Here w(A) is a regularization term and α > 0 is a parameter that balances model ﬁtting and regularization. For the proof of our purpose we adopt the ℓ1/2-regularizer w(A) = 2tr(A1/2) introduced in (Allen-Zhu et al., 2015), which leads to the closed-form solution

where ct R is the unique constant that ensures At p. The following lemma from (Allen-Zhu et al., 2015) bounds the regret of FTRL using the particular ℓ1/2-regularizer:

Lemma 3.1 (Theorem 3.2 of (Allen-Zhu et al., 2015), specialized to ℓ1/2-regularization). Suppose α > 0, rank(Ft) = 1 and let {At}k t=1 be FTRL solutions deﬁned in Eq. (10). If α Ft, A1/2 t > 1 for all t, then

R({At}k t=1) :=

t=1 Ft, At inf U p

Ft, At Ft, A1/2 t

1 + α Ft, A1/2 t + 2 p

Now consider each Ft = xitx it to be the outer product of a design point selected from the design pool X. One remarkable consequence of Lemma 3.1 is that, in order to lower bound the smallest eigenvalue of Pk t=1 Ft, which by deﬁnition is inf U p Pk t=1 Ft, U , it sufﬁces to lower bound Pk t=1 Ft, At . Because At admits closed-form expression in Eq. (10), choosing a sequence of {Ft}k t=1 with large Pk t=1 Ft, At becomes a much more manageable analytical task, which we shall formalize in the next section.

3.3. Proof of Theorem 1.1

Re-organizing terms in Lemma 3.1 we obtain

1 + α Ft, A1/2 t 2 p

The k near-optimal design points are selected in a sequential manner. Let Λt Sb(n, t) be the set of selected design points at or prior to iteration t (Λ0 = ), and deﬁne

Ft = xitx it, where it is the design point selected at iteration t. Deﬁne also Λt = Pt ℓ=1 Fℓ= P

i Λt xix i .

We ﬁrst consider the with replacement setting b = 1.

Lemma 3.2. Suppose Pn i=1 π i xix i = I where π i 0 and Pn i=1 π i = r. Then for 1 t k we have that

maxi [n] xix i ,At

1+α xix i ,A1/2 t 1 r+α p.

Proof. Recall that tr(At) = 1 and Pn i=1 π i xix i = I. Subsequently, Pn i=1 π i xix i , At = 1. On the other hand, we have that Pn i=1 π i (1 + α xix i , A1/2 t ) = Pn i=1 π i + α tr(A1/2 t ) (a) r + α tr(A1/2 t ) (b) r + α p. Here (a) is due to the optimization constraint that π 1 r, and (b) is because tr(A1/2 t ) = σ(A1/2 t ) 1 p σ(A1/2 t ) 2 = p p

σ(At) 1 = p p

tr(At) = p, where σ( ) is the vector of all eigenvalues of a PSD matrix. Combining both inequalities we have that

maxi [n] xix i ,At

1+α xix i ,A1/2 t

Pn i=1 π i xix i ,At Pn i=1 π i (1+α xix i ,A1/2 t ),

where the right-hand side is lower bounded by 1/(r + α p).

Let it = arg maxi [n] xix i ,At

1+α xix i ,A1/2 t be the design point

selected at iteration t. Combining Eq. (11) and Lemma 3.2,

i Λk xix i k r + α p 2 p

To prove Eq. (3), set α = 8 p/ε. Because k = r C0p/ε2, we have that k r+α p 2 p

α 1 1+8ε/C0 ε

4. With C0 = 32 the right-hand side is lower bounded by 1 ε/2. Eq. (3) is thus proved because (1 ε/2) 1 1 + ε.

We next consider the without replacement setting b = 2.

Lemma 3.3. Fix arbitrary β (0, 1] and suppose Pn i=1 π i xix i = I where π i [0, β] and Pn i=1 π i = r. Then for all 1 t k,

max i/ Λt 1 xix i , At

1 + α xix i , A1/2 t 1 βσmin(Λt 1) p/α

Proof. On one hand, we have P

i/ Λt 1 π i xix i , At (a)

At, I βΛt 1 (b) = 1 tr h (αΛt 1 + ct I) 2 βΛt 1 i =

αtr h (αΛt 1 + ct I) 1i = 1+ βct

α tr(A1/2 t ) α

α . Here (a) is due to Pn i=1 π i xix i = I and π i [0, β]; (b) is due to At, I = tr(At) = 1 and (c) is proved in the proof of Lemma 3.2. Because αΛt 1+ct I 0, we conclude that ct ασmin(Λt 1) and therefore P

i/ Λt 1 π i xix i , At 1 βσmin(Λt 1) p/α.

On the other hand, P

i/ Λt 1 π i (1 + α xix i , A1/2 t )

Near-optimal design of experiments via regret minimization

r + α p by the same argument as in the proof of

Lemma 3.2. Subsequently, maxi/ Λt 1 xix i ,At

1+α xix i ,A1/2 t P

i/ Λt 1 π i xix i ,At P

i/ Λt 1 π i (1+α xix i ,A1/2 t ) 1 βσmin(Λt 1) p/α

Let it = arg maxi/ Λt 1 xix i ,At

1+α xix i ,A1/2 t . Combining

Eq. (11) and Lemma 3.3 with β = 1, we have that

r + α p 2 p

where κt := σmin(Λt). We are now ready to prove Eqs. (4,5) in Theorem 1.1.

Proof of Eq. (4). Note that

Λk sup u>0 min u, 1 u p/α

r + α p k 2 p

Eq. (14) can be proved by a case analysis: if u κt for some 1 t k then σmin(Λk) σmin(Λt 1) u; otherwise 1 κt p/α 1 u p/α for all 1 t k. Suppose k = r ξp for some ξ > 2. and let α = ν p, u = (1 2/ξ)ν 3

ν(2+ν/ξ) , where ν > 1 is some parameter to be

speciﬁed later. Eq. (14) then yields Λk (1 2/ξ)ν 3

ν(2+ν/ξ) I. Because ξ > 2, it is possible to select ν > 0 such that C1(ξ) 1 = (1 2/ξ)ν 3

ν(2+ν/ξ) > 0. Finally, for ξ 4 and ν = 8 we have C1(ξ) 1 1/32. Eq. (4) is thus proved.

Proof of Eq. (5). Let β (0, 1) be a parameter to be speciﬁed later, and deﬁne Σ β := P

π i β π i xix i and Σ β :=

π i <β π i xix i . Let ˆS be constructed such that it includes all points in S β := {i : π i β}, plus the resulting set by running Algorithm 1 on the remaining weights smaller than β, with subset size k k = k |S β|. Deﬁne α = 2 p/ε, r := P

π i β π i , k := k k and r := r r + α p = r r + 2p/ε. Let Λ = P

i ˆS xix i be the sample covariance of the selected subset. By the deﬁnition of ˆS and Lemma 3.3, together with the whitening trick (Sec. 3.1) on Σ β, we have

Λ Σ β + sup u>0 min n u, (1 βu ε/2) k/ r ε o Σ β

sup u>0 min n u, (1 βu ε/2) k/ r ε o I,

where the second line holds because Σ β + Σ β = I and u 1. Now set β = 0.5 and note that k r /β 2r by deﬁnition of S β. Subsequently, r p/ε2 and k

4(1 + 7ε)r for ε (0, 1/2) implies that k r 1+2ε (1 ε/2)(1 β), which yields u 1 ε/2 and hence f(X ˆS X ˆS) (1 + ε)f(X S XS ). Eq. (5) is thus proved.

Algorithm 1 Near-optimal experimental design

1: Input: design pool X Rn p, budget parameters k r p, algorithmic parameter α > 0. 2: Solve the convex optimization problem Eq. (6) with parameter s; Let π be the optimal solution; 3: Whitening: X X(X diag(π )X) 1/2; 4: Initialization: Λ0 = ; 5: for t = 1 to k do 6: ct FINDCONSTANT(P i Λt 1 xix i , α);

7: At (ct I + P i Λt 1 xix i ) 2; 8: If b = 1 then Γt = [n]; else Γt = [n]\Λt 1;

9: it arg maxi Γt xix i ,At

1+α xix i ,A1/2 t ;

10: Λt = Λt 1 {it}; 11: end for 12: Output: ˆS = Λk.

Algorithm 2 FINDCONSTANT(Z, α)

1: Initialization: cℓ= σmin(Z), cu = p; ϵ = 10 9; 2: while |cℓ cu| > ϵ do 3: c (cℓ+ cu)/2; 4: If tr[( c I + Z) 2] > 1 then cℓ c; else cu c; 5: end while 6: Output: c = (cℓ+ cu)/2.

Our proof of Theorem 1.1 is constructive and yields a computationally efﬁcient iterative algorithm which ﬁnds subset ˆS Sb(n, k) that satisﬁes the approximation results in Theorem 1.1. In Algorithm 1 we give a pseducode description of the algorithm, which makes use of a binary search routine (Algorithm 2) that ﬁnds the unique constant ct for which tr(At) = tr[(ct I + P

i Λt 1 xix i ) 2] = 1. Note that for Eq. (5) to be valid, it is necessary to run Algorithm 2 on the remaining set of π after including all points xi with π i 1/2 in ˆS.

4. Extension to generalized linear models

The experimental algorithm presented in this paper could be easily extended beyond the linear regression model. For this purpose we consider the Generalized Linear Model (GLM), which assumes that

y|x i.i.d. p(y|x β0),

where p( | ) is a known distribution and β0 is an unknown p-dimensional regression model. Examples include the logistic regression model p(y = 1|x) = exp(x β0) 1+exp(x β0), the

Possion count model p(yi = y|x) = exp(yx β0 e x β0)

y! , and many others.

Let S Sb(n, k) be the set of selected design points from X. Under the classical statistics regime,

Near-optimal design of experiments via regret minimization

Table 2. Simulation results on synthetic data of size n = 1000 and k = 50. Uniform sampling and weighted sampling are run for 50 independent trials and the median objective is reported. Inf means the sample covariance X S XS does not belong to S+ p .

k = 2p = 100 k = 3p = 150 f A f D f T f E f V f G f A f D f T f E f V f G

UNIFORM SAMPLING 34.29 7.25 2.05 349.4 101.4 381.4 24.61 6.40 2.03 196.2 73.7 219.1 WEIGHTED SAMPLING 23.42 4.57 Inf Inf 60.22 202.6 11.18 4.26 0.96 Inf 46.20 119.5 FEDOROV S EXCHANGE 23.17 5.52 1.15 172.9 44.43 117.7 12.26 4.65 1.22 173.7 73.97 101.8 (running time /secs) 4.6 26 2442 28 488 8893 296 282 311 360 < 1 11478 ALGORITHM 1 12.55 4.72 1.19 53.52 50.47 90.77 11.90 4.60 1.27 41.53 45.97 80.94 (running time /secs) < 1 < 1 < 1 < 1 < 1 < 1 < 2 < 2 < 2 < 2 < 2 < 2

k = 5p = 250 k = 10p = 500

UNIFORM SAMPLING 20.02 5.82 2.00 137.1 60.2 155.2 17.57 5.51 2.02 103.9 52.93 123.5 WEIGHTED SAMPLING 10.36 4.23 1.14 Inf 41.91 90.61 11.22 4.53 1.44 52.75 43.04 80.74 FEDOROV S EXCHANGE 11.70 5.84 1.38 116.1 53.14 133.67 12.13 5.52 1.65 108.4 45.05 99.07 (running time /secs) 441 < 1 352 2552 196 1152 100 < 1 575 < 1 1515 26804 ALGORITHM 1 11.14 4.67 1.38 36.67 45.6 76.20 11.60 4.77 1.56 49.27 45.14 81.78 (running time /secs) < 2 < 2 < 2 < 2 < 2 < 2 < 5 < 5 < 5 < 5 < 5 < 5

the maximum likelihood (ML) estimator ˆβ ML = arg minβ P

i S log p(yi|x i β) is asymptotically efﬁcient, and its asymptotic variance equals the Fisher s information

I(XS; β0) := X

i S Ey|x i β0

2 log p(y|xi; β0)

ηi=x i β0 = X

2 log p(y|ηi)

Here the second equality is due to the sufﬁciency of x i β0 in a GLM. Note that for the linear regression model y = Xβ0 + w, the ML estimator is the ordinary least squares (OLS) ˆβ = (X S XS) 1XSy S and its Fisher s information equals the sample covariance X S XS. The experimental design problem can then be formalized as follows: 2

min S Sb(n,k) f(I(XS; β0)) = min S Sb(n,k) f

2 log p(yi|ηi)

, ηi = x i β0.

Suppose ˇβ is a pilot estimate of β0, obtained from a uniformly sampled design subset S1. A near-optimal design set S2 can then be constructed by minimizing Eq. (15) using ˇηi = x i ˇβ. Such an approach was adopted in sequential design and active learning for ML estimators (Chaudhuri et al., 2015; Khuri et al., 2006); however, with our algorithm the quality of S2 is greatly improved.

2Under very mild conditions E[ 2 log p

η2 ] = E[( log p

η )2] is non-negative (Van der Vaart, 2000).

5. Numerical results

We compare the proposed method with several baseline methods on both synthetic and real-world data sets. We only consider the harder without replacement setting, where each row of X can be selected at most once.

5.1. Methods and their implementation

We compare our algorithm with three simple heuristic methods that apply to all optimality criteria:

1. Uniform sampling: ˆS is sampled uniformly at random without replacement from the design pool X;

2. Weighted sampling: ﬁrst the optimal solution π of Eq. (6) is computed with r = k; afterwards, ˆS is sampled without replacement according to the distribution speciﬁed by π /k. Recall that (Wang et al., 2016) proved that weighted sampling works when k is sufﬁciently large compared to p (cf. Table 1). 3

3. Fedorov s exchange (Miller & Nguyen, 1994): the algorithm starts with a random subset S0 Sb(n, k) and iteratively exchanges two coordinates i S0, j / S0 such that the objective is minimized after the exchange. The algorithm terminates if no such exchange can reduce the objective, or T iterations are reached.

All algorithms are implemented in MATLAB, except for the Fedorov s exchange algorithm, which is implemented in C due to efﬁciency concerns. We also apply the Sherman Morrison formula (A+λuu ) 1 = A 1+ λA 1uu A 1

1+λu A 1u and the matrix determinant lemma det(A + λuu ) =

3Fact 2.2 ensures that π /k is a valid probability distribution.

Near-optimal design of experiments via regret minimization

Table 3. Results on the Minnesota wind speed dataset (n =

2642, p = 15, k = 30). MSE is deﬁned as q

1 n y Vˆβ 2 2.

f V MSE f G MSE

UNIFORM SAMPLING 94.1 1.10 3093 1.34 WEIGHTED SAMPLING 21.4 0.89 2451 1.13 FEDOROV S EXCHANGE 10.0 0.86 29.2 0.78 (running time /secs) 15 - 1857 - ALGORITHM 1 10.8 0.72 29.2 0.76 (running time /secs) < 1 - < 1 -

FULL-SAMPLE OLS - 0.55 - 0.55

(1 + λu A 1u ) det(A) to accelerate computations of rank-1 updates of matrix inverse and determinant. For uniform sampling and weighted sampling, we report the median objective of 50 indpendent trials. We only report the objective for one trial of Fedorov s exchange method due to time constraints. The maximum number of iterations T for Fedorov s exchange is set at T = 100. We always set k = r in the optimization problem Eq. (6), and details of solving Eq. (6) are placed in the appendix. In Algorithm 1 we set α = 10; our similuations suggest that the algorithm is not sensitive to α.

5.2. Synthetic data

We synthesize a 1000 50 design pool X as follows:

X = XA 0500 25 0500 25 XB

XA is a 500 25 random Gaussian matrix, re-scaled so that the eigenvalues of X AXA satisfy a quadratic decay: σj(X AXA) j 2; XB is a 500 25 Gaussian matrix with i.i.d. standard Normal variables. Both XA and XB have comparable Frobenius norm.

In Table 2 we report results on all 6 optimality criteria (f A, f D, f T , f E, f V , f G) for k {2p, 3p, 5p, 10p}. We also report the running time (measured in seconds) of Algorithm 1 and the Fedorov s exchange algorithm. The other two sampling based algorithms are very efﬁcient and always terminate within one second. We observe that our algorithm has the best performance for f E and f G, while still achieving comparable results for the other optimality criteria. It is also robust when k is small compared to p, while sampling based methods occasionally produce designs that are not even full rank. Finally, Algorithm 1 is computationally efﬁcient and terminates within seconds for all settings.

5.3. The Minnesota wind speed dataset

The Minnesota wind dataset collects wind speed information across n = 2642 locations in Minnesota, USA for a

period of 24 months (for the purpose of this experiment, we only use wind speed data for one month). The 2642 locations are connected with 3304 bi-directional roads, which form an n n sparse unweighted undirected graph G. Let L = diag(d)?G be the n n Laplacian of G, where d is a vector of node degrees, and let V Rn p be an orthonormal eigenbasis corresponding to the smallest p eigenvalues of L. (Chen et al., 2015) shows that the relatively smooth wind speed signal y Rn can be well approximated by using only p = 15 graph Laplacian basis.

In Table 3 we compare the mean-square error (MSE) for prediction on the full design pool V: MSE = q

1 n y Vˆβ 2 2. Because the objective is prediction based, we only consider the two prediction related criteria: f V (Σ) = tr(VΣ 1V ) and f G(Σ) = max diag(VΣ 1V ). The subset size k is set as k = 2p = 30, which is much smaller than n = 2642. We observe that Algorithm 1 consistently outperforms the other heuristic methods, and is so efﬁcient that its running time is negligible. It is also interesting that by using k = 30 samples Algorithm 1 already achieves an MSE that is comparable to the OLS on the entire n = 2642 design pool.

6. Concluding remarks and open questions

We proposed a computationally efﬁcient algorithm that approximately computes optimal solutions for the experimental design problem, with near-optimal requirement on k (i.e., the number of experiments to choose). In particular, we obtained a constant approximation under the very weak condition k > 2p, and a (1 + ε) approximation if replacement or over-sampling is allowed. Our algorithm works for all regular optimality criteria.

An important open question is to achieve (1 + ε) relative approximation ratio under the proper sampling regime k = r, or the slight over-sampling regime k = (1 + δ)r, for the without replacement model. It was shown in (Wang et al., 2016) that a simple greedy method achieves (1 + ε) approximation ratio for Aand V-optimality provided that k = Ω(p2/ε). Whether such analysis can be extended to other optimality criteria and whether the p2 term can be further reduced to a near linear function of p remain open.

Another practical question is to develop fast-converging optimization methods for the continuous problem in Eq. (6), especially for criteria that are not differentiable such as the Eand G-optimality, where subgradient methods have very slow convergence rate.

Acknowledgement This work is supported by NSF grants CAREER IIS-1252412 and CCF-1563918. We thank Adams Wei Yu for providing an efﬁcient implementation of the projection step, and other useful discussions.

Near-optimal design of experiments via regret minimization

Ageev, Alexander A and Sviridenko, Maxim I. Pipage rounding: A new method of constructing algorithms with proven performance guarantee. Journal of Combinatorial Optimization, 8(3):307 328, 2004.

Allen-Zhu, Zeyuan, Liao, Zhenyu, and Orecchia, Lorenzo. Spectral sparsiﬁcation and regret minimization beyond matrix multiplicative updates. In Proceedings of Annual Symposium on the Theory of Computing (STOC), 2015.

Avron, Haim and Boutsidis, Christos. Faster subset selection for matrices and applications. SIAM Journal on Matrix Analysis and Applications, 34(4):1464 1499, 2013.

Balcan, Maria-Florina and Long, Philip M. Active and passive learning of linear separators under log-concave distributions. In Proceedings of Annual Conference on Learning Theory (COLT), 2013.

Bouhtou, Mustapha, Gaubert, Stephane, and Sagnol, Guillaume. Submodularity and randomized rounding techniques for optimal experimental design. Electronic Notes in Discrete Mathematics, 36:679 686, 2010.

Boutsidis, Christos and Woodruff, David P. Optimal CUR matrix decompositions. In Proceedings of Annual Symposium on the Theory of Computing (STOC), 2014.

ˇCern y, Michal and Hlad ık, Milan. Two complexity results on C-optimality in experimental design. Computational Optimization and Applications, 51(3):1397 1408, 2012.

Chaudhuri, Kamalika, Kakade, Sham, Netrapalli, Praneeth, and Sanghavi, Sujay. Convergence rates of active learning for maximum likelihood estimation. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015.

Chen, Siheng, Varma, Rohan, Singh, Aarti, and Kovaˇcevi c, Jelena. Signal representations on graphs: Tools and applications. ar Xiv preprint ar Xiv:1512.05406, 2015.

C ivril, Ali and Magdon-Ismail, Malik. On selecting a maximum volume sub-matrix of a matrix and related problems. Theoretical Computer Science, 410(47-49):4801 4811, 2009.

Deshpande, Amit and Rademacher, Luis. Efﬁcient volume sampling for row/column subset selection. In Proceedings of Annual Conference on Foundations of Computer Science (FOCS), 2010.

Deshpande, Yash and Montanari, Andrea. Linear bandits in high dimension and recommendation systems. In Proceedings of Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2012.

Dhillon, Paramveer, Lu, Yichao, Foster, Dean P, and Ungar, Lyle. New subsampling algorithms for fast least squares regression. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2013.

Drineas, Petros and Mahoney, Michael W. On the Nystr om method for approximating a gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6(12):2153 2175, 2005.

Drineas, Petros, Mahoney, Michael W, and Muthukrishnan, S. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2): 844 881, 2008.

Fedorov, Valerii Vadimovich. Theory of optimal experiments. Elsevier, 1972.

Hazan, Elad and Karnin, Zohar. Hard-margin active linear regression. In Proceedings of International Conference on Machine Learning (ICML), 2015.

Horel, Thibaut, Ioannidis, Stratis, and Muthukrishnan, S. Budget feasible mechanisms for experimental design. In Proceedings of Latin American Symposium on Theoretical Informatics (LATIN), 2014.

Huang, Ruitong, Lattimore, Tor, Gy orgy, Andr as, and Szepesv ari, Csaba. Following the leader and fast rates in linear prediction: Curved constraint sets and other regularities. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2016.

Khuri, Andre, Mukherjee, Bhramar, Sinha, Bikas, and Ghosh, Malay. Design issues for generalized linear models: a review. Statistical Science, 21(3):376 399, 2006.

Kulesza, Alex and Taskar, Ben. Determinantal point processes for machine learning. Foundations and Trends R in Machine Learning, 5(2 3):123 286, 2012.

Li, Chengtao, Jegelka, Stefanie, and Sra, Suvrit. Polynomial time algorithms for dual volume sampling. ar Xiv preprint ar Xiv:1703.02674, 2017.

Ma, Yifei, Garnett, Roman, and Schneider, Jeff. σoptimality for active learning on Gaussian random ﬁelds. In Proceedings of Advances in Neural Information Processing Systems (NIPS), 2013.

Mc Mahan, H Brendan. Follow-the-regularized-leader and mirror descent: Equivalence theorems and L1 regularization. In Procedings of International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2011.

Miller, Alan and Nguyen, Nam-Ky. A Fedorov exchange algorithm for d-optimal design. Journal of the Royal Statistical Society, Series C (Applied Statistics), 43(4):669 677, 1994.

Near-optimal design of experiments via regret minimization

Pilanci, Mert and Wainwright, Martin J. Iterative Hessian sketch: Fast and accurate solution approximation for constrained least-squares. Journal of Machine Learning Research, 17(53):1 38, 2016.

Pukelsheim, Friedrich. Optimal design of experiments. SIAM, 2006.

Raskutti, Garvesh and Mahoney, Michael. A statistical perspective on randomized sketching for ordinary leastsquares. ar Xiv preprint ar Xiv:1406.5986, 2014.

Ravi, Sathya N, Ithapu, Vamsi K, Johnson, Sterling C, and Singh, Vikas. Experimental design on a budget for sparse linear models and applications. In Proceedings of International Conference on Machine Learning (ICML), 2016.

Silva, Marcel K, Harvey, Nicholas JA, and Sato, Cristiane M. Sparse sums of positive semideﬁnite matrices. ACM Transactions on Algorithms, 12(1):9, 2016.

Spielman, Daniel A and Srivastava, Nikhil. Graph sparsiﬁcation by effective resistances. SIAM Journal on Computing, 40(6):1913 1926, 2011.

Su, Hao, Yu, Adams Wei, and Li, Fei-Fei. Efﬁcient euclidean projections onto the intersection of norm balls. In Proceedings of International Conference on Machine Learning (ICML), 2012.

Van der Vaart, Aad W. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Wang, Shusen and Zhang, Zhihua. Improving CUR matrix decomposition and the Nystr om approximation via adaptive sampling. Journal of Machine Learning Research, 14(1):2729 2769, 2013.

Wang, Yining and Singh, Aarti. An empirical comparison of sampling techniques for matrix column subset selection. In Proceedings of Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015a.

Wang, Yining and Singh, Aarti. Provably correct active sampling algorithms for matrix column subset selection with missing data. ar Xiv preprint ar Xiv:1505.04343, 2015b.

Wang, Yining and Singh, Aarti. Noise-adaptive marginbased active learning and lower bounds under tsybakov noise condition. In Proceedings of AAAI Conference on Artiﬁcial Intelligence (AAAI), 2016.

Wang, Yining, Yu, Wei Adams, and Singh, Aarti. On computationally tractable selection of experiments in regression models. ar Xiv preprints: ar Xiv:1601.02068, 2016.

Woodruff, David P. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1 2):1 157, 2014.