# robust_structured_estimation_with_singleindex_models__e63745cf.pdf Robust Structured Estimation with Single-Index Models Sheng Chen 1 Arindam Banerjee 1 In this paper, we investigate general single-index models (SIMs) in high dimensions. Based on U-statistics, we propose two types of robust estimators for the recovery of model parameters, which can be viewed as generalizations of several existing algorithms for one-bit compressed sensing (1-bit CS). With minimal assumption on noise, the statistical guarantees are established for the generalized estimators under suitable conditions, which allow general structures of underlying parameter. Moreover, the proposed estimator is novelly instantiated for SIMs with monotone transfer function, and the obtained estimator can better leverage the monotonicity. Experimental results are provided to support our theoretical analyses. 1. Introduction In machine learning and statistics, a linear model of the form y = θ , x +ϵ is widely used to find the relationship between feature and response, which has gained overwhelming popularity for a very long time. Here y R and x Rp is the pair of observed response and feature/measurement vector, ϵ is a zero-mean noise, and θ Rp is the unknown parameter to be estimated. The simplicity of linear model leads to its great interpretability and computational efficiency, which are often favored in practical applications. On theoretical side, even in high-dimensional regime where sample size is smaller than the problem dimension p, strong statistical guarantees have been established under mild assumptions for various estimators, such as Lasso (Tibshirani, 1996) and Dantzig selector (Candes & Tao, 2007). Despite its attractive merits, one main drawback of linear models is the stringent assumption of linear relationship between x and y, which may fail to hold in com- 1Department of Computer Science & Engineering, University of Minnesota-Twin Cities, Minnesota, USA. Correspondence to: Sheng Chen , Arindam Banerjee . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). plicated scenarios. To introduce more flexibility, one option is to consider the general single-index models (SIMs) (Ichimura, 1993; Horowitz & Hardle, 1996), E[y|x] = f ( θ , x ) , (1) where f : R 7 R is an unknown univariate transfer function (a.k.a. link function). This class of models enjoys rich modeling power in the sense that it encompasses several useful models as special cases, which are briefly described below: One-bit Compressed Sensing: In one-bit compressed sensing (1-bit CS) (Boufounos & Baraniuk, 2008; Plan & Vershynin, 2013), the response y is restricted to be binary, i.e., y {+1, 1}, and the range of transfer function f is [ 1, 1]. Given the measurement vector x, one can generate y from the Bernoulli model, 2 Ber (f ( θ , x ) + 1 In the noiseless case, f (z) = sign(z) and y always reflects the true sign of θ , x , while y can be incorrect for other f whose shape determines the noise level in some way. Generalized Linear Models: In generalized linear models (GLMs) (Mc Cullagh, 1984), the transfer function is assumed to be monotonically increasing and conditional distribution of y|x belongs to exponential family. Different choices of f give rise to different members in GLMs. If f is identity function f (z) = z, one has the simple linear models, while the sigmoid function f (z) = 1 1+e z results in the logistic model for binary classification. In this work, however, we have no access to exact f other than knowing it is monotonic. Noise in Monotone Transfer: Instead of having the general expectation form of y as GLMs, one could directly introduce the noise inside monotone transfer f to model the randomness of y (Plan et al., 2016), y = f ( θ , x + ϵ) . (3) In this setting, the transfer function f is slightly different from the f in (1), which are related by f (z) = Eϵ[ f(z + ϵ)|z]. Robust Structured Estimation with Single-Index Models A key advantage of SIM is its robustness. First, allowing unknown f prevents the mis-specification of transfer function, which could otherwise lead to a poor estimate of θ . Secondly, the model in (1) makes minimal assumption on the distribution of y, thus being able to tolerate potentially heavy-tailed noise. In order to estimate θ , we are given n measurements of (x, y) Rp R, denoted by {(xi, yi)}n i=1. In this work, we focus on the n < p regime. In such high-dimensional setting, the recovery of θ is quite challenging as the problem is ill-posed even when f is given. Over the last decade, substantial progress has been made to address the challenge by exploiting the apriori structure of parameter θ , like sparsity (Tibshirani, 1996). For simple linear models or GLMs with known transfer, extensive studies have shown that sparse θ can be consistently estimated under mild assumptions, with much lower sample complexity than p (Candes & Tao, 2007; Wainwright, 2009; Bickel et al., 2009; Kakade et al., 2010; Negahban et al., 2012; Yang et al., 2016). Recently the notion of structure has been suitably generalized beyond the unstructured sparsity (Bach et al., 2012), and Gaussian width (Gordon, 1985) has emerged as a useful measure to characterize the structural complexity which further determines the recovery guarantee of θ (Chandrasekaran et al., 2012; Rao et al., 2012; Oymak et al., 2013; Amelunxen et al., 2014; Banerjee et al., 2014; Chatterjee et al., 2014; Vershynin, 2015; Tropp, 2015; Chen & Banerjee, 2016). In the absence of exact f , though 1-bit CS and related variants were well-studied in recent years (Boufounos & Baraniuk, 2008; Jacques et al., 2013; Plan & Vershynin, 2013; Gopi et al., 2013; Zhang et al., 2014; Chen & Banerjee, 2015a; Zhu & Gu, 2015; Yi et al., 2015; Slawski & Li, 2015; Li, 2016; Slawski & Li, 2016), the exploration of general SIMs or the cases with monotone transfers is relatively limited, especially in the high-dimensional regime. Kalai & Sastry (2009) and Kakade et al. (2011) investigated the low-dimensional SIMs with monotone transfers, and they proposed perceptron-type algorithms to estimate both f and θ , with provable guarantees on prediction error. In high dimension, general SIMs were studied by Alquier & Biau (2013) and Radchenko (2015), in which only unstructured sparsity of θ is considered. The algorithm developed in (Alquier & Biau, 2013) relies on reversible jump MCMC, which could be slow. In Radchenko (2015), a path fitting algorithm is designed to recover f and θ , but only asymptotic guarantees are provided. Ganti et al. (2015) considered the high-dimensional setting with monotone transfer, and their iterative algorithm is based on non-convex optimization, for which it is hard to establish the convergence. Besides, the prediction error bound they derived is also weak (in the sense that it is even worse than the initialization of the algorithm). Recently Oymak & Soltanolkotabi (2016) proposed a constrained least-squares method to estimate θ , with recovery error characterized by Gaussian width and related quantities. Though their analysis considered the general structure of θ , it only holds for noiseless setting where y = f( θ , x ). General structure of θ was also explored in Vershynin (2015) and Plan et al. (2016). Other types of statistical guarantees for high-dimensional SIMs is also available, such as support recovery of θ in Neykov et al. (2016). It is worth noting that all the aforementioned statistical analyses rely on sub-Gaussian noise or the transfer function being bounded or Lipschitz, which indicates that none of the results can immediately hold for heavy-tailed noise (or without Lipschitzness and boundedness). In this paper, we focus on the parameter estimation of θ instead of the prediction of y given new x. In particular, we propose two families of generalized estimators, constrained and regularized, for model (1) under Gaussian measurement. The parameter θ is assumed to possess certain lowcomplexity structure, which can be either captured by a constraint θ K or a norm regularization term θ . Our general approach is inspired by U-statistics and the advances in 1-bit CS, and subsumes several existing 1-bit CS algorithms as special cases. Similar to those algorithms, our estimator is simple and often admits closed-form solutions. Regarding the recovery analysis, there are two appealing aspects. First our results work for general structure, with error bound characterized by Gaussian width and some other easy-to-compute geometric measures. Instantiating our results with specific structure of θ recovers previously established error bounds for 1-bit CS (Zhang et al., 2014; Chen & Banerjee, 2015a), which are sharper than those yielded by the general analysis in Plan & Vershynin (2013). Second, our analysis works with limited assumptions on the condition distribution of y. In particular, our estimator is robust to heavy-tailed noise and permit unbounded transfer functions f as well as non-Lipschitz ones. At the heart of our analysis is the generic chaining method (Talagrand, 2014), an advanced tool in probability theory, which has been successfully applied to sparse recovery (Koltchinskii, 2011) and dimensionality reduction (Dirksen, 2016), etc. Another key ingredient in our proof is a Hoeffding-type concentration inequality for U-statistics (Lee, 1990) with sub-Gaussian tails, which is less known yet generalizes the popular one for bounded U-statistics (Hoeffding, 1963). Apart from 1-bit CS, we particularly investigate the model (3), for which the generalized estimator is specialized in a novel way. The resulting estimator better leverages the monotonicity of the transfer function, which is also demonstrated through experiments. For the ease of exposition, whenever we say monotone , it means monotonically increasing by default. Throughout the Robust Structured Estimation with Single-Index Models paper, we will use c, C, C , C0, C1 and so on to denote absolute constants, which may differ from context to context. Detailed proofs are deferred to the supplementary material due to page limit. The rest of the paper is organized as follows. In Section 2, we introduce our estimators for SIMs along with their recovery guarantees. We also provide a few examples in 1-bit CS for illustration. Section 3 is focused on model (3), for which we instantiate the general results in a new way. Other structures of θ beyond unstructured sparsity are also discussed. Section 4 provides the proof of our main results and the related lemmas. In Section 5, we complement our theoretical developments with some experiment results. The final section is dedicated to conclusions. 2. Generalized Estimation for Structured Parameter 2.1. Assumptions and Preliminaries For the sake of identifiability, we assume w.l.o.g. that θ 2 = 1 throughout the paper. At the first glimpse of model (1), we may realize that it is difficult to recover θ due to unknown f . In contrast, when f is given, the recovery guarantees of θ can be established under mild assumptions of x and y, such as boundedness or sub Gaussianity. If we know certain properties of the transfer function like the monotonicity introduced in GLMs and (3), the structure of f is largely restricted, and it is tempting to expect that similar results will continue to hold. Unfortunately, we first have the following claim, which indicates that without other constraints on f beyond strict monotonicity, θ cannot be consistently estimated under general sub-Gaussian (or bounded) measurement, even in the noiseless setting of (3). Claim 1 Suppose that each element xi of x is sampled i.i.d. from Rademacher distribution, i.e., P(xi = 1) = P(xi = 1) = 0.5. Under model (3) with noise ϵ = 0, there exists a θ Sp 1 together with a monotone f, such that supp( θ) = supp(θ ) and yi = f( θ, xi ) for data {(xi, yi)}n i=1 with arbitrarily large sample size n, while θ θ 2 > δ for some constant δ. Now that consistent estimation of θ is not possible for general sub-Gaussian measurement, it might be reasonable to focus on certain special cases. For this work, we assume that x is standard Gaussian N(0, I). For SIM (1), we additionally assume that the distribution of y depends on x only through the value of θ , x , i.e., the distribution of y|x is fixed if θ , x is given (no matter what the exact x is). This assumption is quite minimal, and it turns out that the examples we provide in Section 1 all satisfy it (if noise ϵ is independent of x in (3)). The same assumption is used in Plan et al. (2016) as well. Under the assumptions above, given m i.i.d. observations (x1, y1), . . . , (xm, ym), we define u ((x1, y1), . . . , (xm, ym)) = i=1 qi (y1, . . . , ym) xi , (4) where all qi : Rm 7 R are bounded functions with |qi| 1, which are chosen along with m based on the properties of the transfer function. In Section 2.4 and 3.1, we will see examples for their choices. The vector u Rp is critical due to the key observation below. Lemma 1 Suppose the distribution of y in model (1) depends on x through θ , x and we define accordingly bi (z1, . . . , zm; θ ) = (5) E [qi (y1, . . . , ym) | θ , x1 = z1, . . . , θ , xm = zm] . With x being standard Gaussian N(0, I), u defined in (4) satisfies E [u ((x1, y1), . . . , (xm, ym))] = βθ , (6) where β = m i=1 E[bi (g1, . . . , gm; θ ) gi], and g1, . . . , gm are i.i.d. standard Gaussian. Note that Lemma 1 is true for all choices of qi, and the proof is given in the supplement. This lemma presents an insight towards the design of our estimator, that is, the direction of θ can be approximated if we have a good sense about Eu. As we will see in the sequel, the scalar β plays a key role in the estimation error bound, which can give us clues to the choice of qi. We can assume w.l.o.g. that β 0 since we can flip the sign of each qi. The recovery analysis is built on the notion of Gaussian width (Gordon, 1985), which is defined for any A Rp as w(A) = E[supv A g, v ], where g is a standard Gaussian random vector. Roughly speaking, w(A) measures the scaled width of set A averaged over each direction. 2.2. Generalized Estimator Inspired by Lemma 1, we define the vector ˆu for the observed data {(xi, yi)}n i=1, ˆu = (n m)! 1 i1,...,im n i1 =... =im u ((xi1, yi1), . . . , (xim, yim)) , (7) which is an unbiased estimator of Eu, meaning that Eˆu = Eu = βθ . When m = 2, we essentially have ˆu = 1 n(n 1) 1 i,j n i =j u ((xi, yi), (xj, yj)) (8) Robust Structured Estimation with Single-Index Models In fact, ˆu can be treated as a vector version of U-statistics with order m. Given ˆu, a naive way to estimate θ is to simply normalize ˆu, i.e., ˆθ = ˆu/ ˆu 2. In highdimensional setting, θ is often structured, but the naive estimator fails to take such information into account, which would lead to large error. To incorporate the prior knowledge on θ , we design two types of estimator, the constrained one and the regularized one. Constrained Estimator: If we assume that θ belongs to some structured set K Sp 1, then the estimation of θ is carried out via the constrained optimization ˆθ = argmin θ Rp ˆu, θ s.t. θ K . (9) Here the set K can be non-convex, as long as the optimization can be solved globally. Since the objective function is very simple, we can often end up with a global minimizer. Similar estimator has been used in Plan et al. (2016), but they only focused on specific ˆu. Regularized Estimator: If we assume that the structure of θ can be captured by certain norm , we may alternatively use the regularized estimator to find θ , ˆθ = argmin θ Rp ˆu, θ + λ θ s.t. θ 2 1 . (10) The optimization is convex, thus the global minimum is always attained. Previously this estimator was used in 1-bit CS scenario with L1 norm (Zhang et al., 2014). 2.3. Recovery Analysis Regarding the constrained estimator, the recovery of θ relies on the geometry of ˆθ, which is described by AK(θ ) = cone { v v = ˆθ θ , ˆθ K } Sp 1 (11) The set AK(θ ) essentially contains all possible directions that error ˆθ θ could lie in. The following theorem characterizes the error of ˆθ. Theorem 1 Suppose that the optimization (9) can be solved to global minimum. Then the following error bound holds for the minimizer ˆθ with probability at least 1 C exp ( w2 (AK(θ )) ) , ˆθ θ 2 Cκm 3 2 β w(AK(θ )) + C where κ is the sub-Gaussian norm of a standard Gaussian random variable, and C, C , C are all absolute constant. Remark: Note that estimator is consistent as long as β = 0. The error bound inversely depends on the scale of β, which implies that we should construct suitable qi such that β is large according to its definition in Lemma 1. The choice of qi further depends on the assumed property of f . Though dependency on m may prevent us from using higher-order u, m is typically small in practice and can be treated as constant. For regularized estimator, we can similarly establish the recovery guarantee in terms of Gaussian width. Theorem 2 Define the following set for any ρ > 1, Aρ (θ ) = cone { v v + θ θ + v If we set λ = ρ ˆu βθ = O(ρm3/2w(B )/ n) and it satisfies λ < ˆu , then with probability at least 1 C exp ( w2 ( B )) , ˆθ in (10) satisfies ˆθ θ 2 C(1 + ρ)κm 3 2 β Ψ (Aρ(θ )) w ( B ) (13) where Ψ (Aρ(θ )) = supv Aρ(θ ) v and B = {v | v 1} is the unit ball of norm . Remark: The geometry of the regularized estimator is slightly different from the constrained one. Instead of having AK(θ ), here the set Aρ(θ ) depends on the choice of the regularization parameter λ. The same phenomenon also appears in the (Banerjee et al., 2014). The geometric measure Ψ (Aρ(θ )) is called restricted norm compatibility, which is non-random. For many interesting cases, it is easy to calculate (Negahban et al., 2012; Chen & Banerjee, 2015b). 2.4. Application to 1-bit CS For 1-bit CS problem (2), the u defined in (4) can be chosen with m = 1 and qi = yi, ending up with u ((x, y)) = yx and ˆu = 1 i=1 yixi . (14) By such choice of u, the β defined in Lemma 1 is simply β = E[f (g)g] with g being standard Gaussian random vector. Under reasonably mild noise, y is likely to take the sign of the linear measurement, which means that f (g) should be close to 1 (or -1) if g is positive (or negative). Thus we expect f (g)g to be positive most of time and β to be large. Given the choice of u, we can specialize our generalized constrained/regularized estimator to obtain previous results. If θ is assumed to be s-sparse, for constrained estimator, we can choose a straightforward K = {θ | θ 0 s} Sp 1, which results in the k-support norm estimator (Chen & Banerjee, 2015a), ˆθks = argmin θ Rp ˆu, θ s.t. θ 0 s, θ 2 = 1 (15) Robust Structured Estimation with Single-Index Models Though K is non-convex, the global minimizer can actually be obtained in closed form, ˆθks j = { ˆuj / |ˆu| 1:s 2 , if |ˆuj| is in |ˆu| 1:s 0 , otherwise (16) where |ˆu| is the absolute-value counterpart of ˆu with entries sorted in descending order, and the subscript takes the top s entries. Similarly if the regularized estimator is instantiated with L1 norm 1, we obtain the so-called passive algorithm introduced in Zhang et al. (2014), ˆθps = argmin θ Rp ˆu, θ + λ θ 1 s.t. θ 2 1 , (17) whose solution is given by ˆθps = S (ˆu, λ) / S (ˆu, λ) 2, where S( , ) is the elementwise soft-thresholding operator, Si(ˆu, λ) = max{sign(ˆui)(|ˆui| λ), 0}. Based on Theorem 1 and 2, we can easily obtain the error bound for both ksupport norm estimator and passive algorithm. Corollary 1 Assume that {(xi, yi)}n i=1 follow 1-bit CS model in (2) and ˆu is given as (14). For any s-sparse θ , with high probability, ˆθ produced by both (15) and (17) (i.e., ˆθks and ˆθps) satisfy The proof is included in the supplementary material. The above result was shown by Slawski & Li (2015) and Zhang et al. (2014), but their analyses do not consider the general structure. Compared with O( 4 s log p/n) yielded by the general result in Plan & Vershynin (2013), our bound is much sharper. 3. A New Estimator for Monotone Transfer In this section, we specifically study model (3). Here we further assume that f is strictly increasing. What is worth mentioning is that the estimator we develop here can be applied to GLMs as well. To avoid the confusion with u and ˆu defined previously, we instead use new notations h and ˆh respectively in this section. 3.1. Estimator with Second-Order ˆh To motivate the design of h, it is helpful to rewrite model (3) by applying the inverse of f on both sides, f 1(y) = θ , x + ϵ . (19) Note that the new formulation resembles the linear model except that we have no access to the value of f 1(y). Instead, all we know about r = [ f 1(y1), . . . , f 1(yn)]T Rn is that it preserves the ordering of y = [y1, . . . , yn]T . Put in another way, r needs to satisfy the constraint that ri > rj iff. yi > yj and ri < rj iff. yi < yj. To move one step further, it is equivalent to sign(yi yj) = sign(ri rj) = sign( θ , xi xj +ϵi ϵj) based on model assumption. Hence the information contained in sample {(xi, yi)}n i=1 can be interpreted from the perspective of 1bit CS, where sign(yi yj) reflects the perturbed sign of linear measurement θ , xi xj . Inspired by the u for 1-bit CS, we may choose m = 2 and define h, ˆh as h ((x1, y1), (x2, y2)) = sign(y1 y2) (x1 x2) , (20) ˆh = 1 n(n 1) 1 i,j n i =j h ((xi, yi), (xj, yj)) , (21) Given the definition of ˆh, Lemma 1 directly implies the following corollary. Corollary 2 Suppose that (x1, y2) and (x2, y2) are generated by model (3), where x1, x2 follow Gaussian distribution N(0, I), and the noise ϵ1, ϵ2 are independent of x1, x2 and identically (but arbitrarily) distributed. Then the expectation of h ((x1, y1), (x2, y2)) satisfies E [h ((x1, y1), (x2, y2))] = 2β θ , (22) where β = Eg N(0,1) [ sign ( g + (ϵ1 ϵ2)/ Remark: The scalar 2β serves as the role of β in Lemma 1, and β is always guaranteed to be strictly positive regardless how the noise is distributed, which keeps θ distinguishable all the time. To see this, let ξ = (ϵ1 ϵ2)/ 2. Note that ξ is symmetric, thus εξ has the same distribution as ξ, where ε is a Rademacher random variable. Therefore β = E [sign (g + ξ) g] = Eg,ξEε [sign (g + εξ) g] [sign (g ξ) + sign (g + ξ) Since g(g ξ) + g(g + ξ) = 2g2 0, it follows that sign(g(g ξ))+sign(g(g +ξ)) = (sign(g ξ)+sign(g + ξ)) sign(g) 0, thus (sign(g ξ) + sign(g + ξ)) g is always nonnegative. Find a large enough M > 0 such that P(|ξ| M) = 0.5 > 0, and we have β = E [sign (g + ξ) g] EξEg [|g| I{|g| > |ξ|}] 0.5Eg [|g| I{|g| > M}] = M 2 P(|g| > M) > 0 . In the ideal noiseless case, β achieve its maximum, β max = E[sign(g)g] = E[|g|] = 2/π. In the worst case, if ϵ1 and ϵ2 are heavy-tailed and dominate g, then β E [ sign ( (ϵ1 ϵ2)/ Robust Structured Estimation with Single-Index Models Now we can instantiate the generalized estimator based on ˆh. For example, if θ is s-sparse, we estimate it by ˆθ = argmin θ Rp ˆh, θ s.t. θ 0 s, θ 2 = 1 (23) which enjoys O ( s log p/n ) error rate as shown in Corollary 1. The regularized estimator can also be obtained with the same ˆh according to (17). The bottleneck of computing ˆθ lies in the calculation of ˆh. A simple proposition below enables us to get ˆh in a fast manner. Proposition 1 Given {(xi, yi)}n i=1, let π be the permutation of {1, . . . , n} such that yπ 1 > yπ 2 > . . . > yπ n. Then we have ˆh = 2 n(n 1) i=1 (n + 1 2i) xπ i (24) Remark: Based on the proposition above, ˆh can be efficiently computed in O(np + n log n) time, i.e., O(n log n) time for sorting y and O(np) time for the weighted sum of all xi. This is a significant improvement compared with the the naive calculation using (21), which takes O(n2p) time. 3.2. Beyond Unstructured Sparsity So far we have illustrated the Gaussian width based error bounds, viz (12) and (13), only through unstructured sparsity of θ . Here we provide two more examples, nonoverlapping group sparsity and fused sparsity. Non-Overlapping Group Sparsity: Suppose the coordinates of θ has been partitioned into K predefined disjoint groups G1, . . . , GK {1, 2, . . . , p}, out of which only k groups are non-zero. If we use the regularized estimator with L2,1 norm θ 2,1 = K i=1 θGi 2, the optimal solution can be similarly obtained as (17), with elementwise soft-thresholding replaced by the groupwise one. The related geometric measures that appears in (13) can be found in Banerjee et al. (2014), which are given by Ψ (Aρ(θ )) O( w ( B 2,1 ) O( Fused Sparsity: θ is said to be s-fused-sparse if the cardinality of the set F(θ ) = {1 i < p | θ i = θ i+1} is smaller than s. If we resort to the constrained estimator (9) with K = {θ | |F(θ)| s, θ 2 = 1}, the associated optimization can be solved by dynamic programming (Bellman, 1961). The proposition below upper bounds the corresponding Gaussian width w(AK(θ )) in (12). Proposition 2 For s-fused-sparse θ , the Gaussian width of set AK(θ ) with K = {θ | |F(θ)| s, θ 2 = 1} satisfies w(AK(θ )) O( s log p) (27) The proof can be found in (Slawski & Li, 2016), and we provide a different one in supplementary material. 4. Lemmas and Proof Sketch of Theorem 1 Here we first present the important technical lemmas that will be used in the proof of Theorem 1. The first one is the Hoeffding-type inequality for sub-Gaussian U-statistics. In the literature, most of the studies are centered around bounded U-statistics, for which the celebrated concentration is established by Hoeffding (1963). Yet it is not easy to locate the counterpart for sub-Gaussian case. Therefore we provide the following result and attach a proof in the supplementary material. Lemma 2 (Concentration for sub-Gaussian U-statistics) Define the U-statistic Un,m(h) = (n m)! 1 i1,...,im n i1 =i2 =... =im h (zi1, . . . , zim) (28) with order m and kernel h : Rd m 7 R based on n independent copies of random vector z Rd, denoted by z1, , zn. If h( , . . . , ) is sub-Gaussian with h ψ2 κ, then the following inequality holds for Un,m(h) with any δ > 0, P (|Un,m(h) EUn,m(h)| > δ) 2 exp ( C n (29) in which C is an absolute constant. As mentioned earlier in Section 1, generic chaining is the key tool that our analysis relies on. Specifically we utilize Theorem 2.2.27 from (Talagrand, 2014). Lemma 3 (Generic chaining concentration) Given metric space (T , s), if an associated stochastic process {Zt}t T has sub-Gaussian incremental, i.e., satisfies the condition P (|Zu Zv| δ) C exp ( C δ2 ) , u, v T , (30) then the following inequality holds P ( sup u,v T |Zu Zv| C1 (γ2(T , s) + δ diam (T , s)) ) C2 exp ( δ2) , (31) where C, C , C1 and C2 are all absolute constants. The definition of the above γ2-functional γ2( , ) is complicated, and is not of great importance. We refer interested Robust Structured Estimation with Single-Index Models reader to the books, Talagrand (2005; 2014). Loosely speaking, γ2(T , s) can be thought of as a measure for the size of set T under metric s. What really matters is the following relationship between γ2-functional and Gaussian width. (see Theorem 2.4.1 in Talagrand (2014)) Lemma 4 (Majorizing measures theorem) For any set T Rp, the γ2-functional w.r.t. L2-metric and Gaussian width satisfy the following inequality with an absolute constant C0, γ2 (T , 2) C0 w(T ) (32) Equipped with these lemmas, we are ready to present the proof sketch of Theorem 1. A complete proof is deferred to the supplementary material. Proof Sketch of Theorem 1: We use the shorthand notation AK for the set AK(θ ). As ˆθ attains the global minimum of (9), we have ˆθ θ , ˆu 0 ˆθ θ , ˆu = ˆθ, θ 1 ˆθ θ 2 sup v AK {0} In order to bound the supremum above, we use the result from generic chaining. We define the stochastic process {Zv = v, ˆu/β θ }v AK {0}. First, we need to check the process has sub-Gaussian incremental. For simplicity, we denote u ((xi1, yi1), . . . , (xim, yim)) by ui1,...,im. By the definitions and properties of sub-Gaussian norm (Vershynin, 2012), it is not difficult to show that ui1,...,im, v w ψ2 κm v w 2 for any v, w AK {0}. By Lemma 2, we have P (|Zv Zw| > δ) 2 exp ( C nβ2δ2 m3κ2 v w 2 2 Therefore we can conclude that {Zv} has sub-Gaussian incremental w.r.t. the metric s(v, w) κm 3 2 v w 2/β n. Now applying Lemma 3 to {Zv} with a bit calculation, we can obtain P ( sup v AK {0} |Zv| C1κm 3 2 β n ( γ2 (AK {0}, 2) + 2δ )) C2 exp ( δ2) Using γ2 (AK {0}, 2) C0 w (AK {0}) implied by Lemma 4 and taking δ = w (AK {0}), we get sup v AK {0} β θ C3κm 3 2 β w (AK) + C4 n with probability at least 1 C2 exp ( w2 (AK) ) . The inequality also uses the fact that w (AK {0}) w (AK) + C4, which is a result of Lemma 2 in Maurer et al. (2014) (See Lemma A in supplementary material). Lastly we turn to the quantity ˆθ θ 2, 2 2 ˆθ, θ 2C3κm 3 2 β w (AK) + C4 n . We finish the proof by letting C = 2C3, C = C4 and C = C2. 5. Experiment In the experiment, we focus on model (3) with sparse θ . The problem dimension is fixed as p = 1000, and the sparsity of θ is set to 10. Essentially we generate our data (x, y) from y = f ( θ , x + ϵ) , where x N(0, I) and ϵ N(0, σ2). σ ranges from 0.3 to 1.5. We choose three monotonically increasing f, f(z) = 1/(1 + exp( z)) (which is bounded and Lipschitz), f(z) = z3 (which is unbounded and non-Lipschitz), and f(z) = log(1 + exp(z)) (which is unbounded but Lipschitz). The sample size n varies from 200 to 1000. We use the estimator (23) in Section 3. The baselines we compare with is the SILO and i SILO algorithm introduced in (Ganti et al., 2015). SILO does not quite take the monotonicity in account. In fact, it is the special case of our generalized constrained estimator which uses the same choice of u as 1-bit CS. The original SILO use the constraint set {θ | θ 1 s, θ 2 1}, which is computationally less efficient and statistically no better than K = {θ | θ 0 s} Sp 1 (Zhang et al., 2014; Chen & Banerjee, 2015a). Hence we also use K in SILO for a fair comparison. i SILO relies on a specific implementation of isotonic regression which explicitly restricts the Lipschitz constant of f to be one. To fit i SILO into our setting, we remove the Lipschitzness constraint and perform the standard isotonic regression. Since the convergence is not guaranteed for the iterative procedure of i SILO, the number of its iterations is fixed to 100. The best tuning parameter of i SILO is obtained by grid search. The experiment results are shown in Figure 1. Overall the i SILO algorithm works well under small noise, while our estimator has better performance when the variance of noise increases. To better demonstrate the robustness of our estimator to heavy-tailed noise, instead of Gaussian noise, we sample ϵ from the Student s t distribution with degrees of freedom equal to 3. We repeat the experiments for f(z) = z3, and obtain the plots in Figure 2. We can see that the error of our estimator consistently decreases for all choice of σ as n increases. For SILO and i SILO, the errors are relatively large, and unable to shrink for large σ even when more data are provided. Robust Structured Estimation with Single-Index Models 200 400 600 800 1000 n Recovery error 200 400 600 800 1000 n 1.5 σ = 0.6 200 400 600 800 1000 n 1.5 σ = 0.9 200 400 600 800 1000 n 1.5 σ = 1.2 200 400 600 800 1000 n 1.5 σ = 1.5 our estimator SILO i SILO (a) Error for f(z) = 1/(1 + exp( z)) 200 400 600 800 1000 n Recovery error 200 400 600 800 1000 n 1.5 σ = 0.6 200 400 600 800 1000 n 1.5 σ = 0.9 200 400 600 800 1000 n 1.5 σ = 1.2 200 400 600 800 1000 n 1.5 σ = 1.5 our estimator SILO i SILO (b) Error for f(z) = log(1 + exp(z)) 200 400 600 800 1000 n Recovery error 200 400 600 800 1000 n 1.5 σ = 0.6 200 400 600 800 1000 n 1.5 σ = 0.9 200 400 600 800 1000 n 1.5 σ = 1.2 200 400 600 800 1000 n 1.5 σ = 1.5 our estimator SILO i SILO (c) Error for f(z) = z3 Figure 1. Recovery error vs. sample size (a) Our estimator has similar performance compared with i SILO, both of which outperform SILO by a large margin. (b) i SILO has smaller error when σ is small, while our estimator works better in high-noise regime (c) The error of SILO is reduced compared with other f, but i SILO fails to give further improvement over SILO when σ is large. Our estimator still outperforms them when σ 0.6. 200 400 600 800 1000 n Recovery error 200 400 600 800 1000 n 1.5 σ = 0.6 200 400 600 800 1000 n 1.5 σ = 0.9 200 400 600 800 1000 n 1.5 σ = 1.2 200 400 600 800 1000 n 1.5 σ = 1.5 our estimator SILO i SILO Figure 2. Recovery error vs. sample size, with f(z) = z3 under heavy-tailed noise 6. Conclusion In this paper, we study the parameter estimation for the high-dimensional single-index models. We propose two classes of robust estimators, which generalize previous works in two aspects. First we allow the diverse structure (e.g., binary, monotone and etc.) of the transfer function, which can help us customize the estimators. Secondly the structure of the true parameter can be general, either encoded by a constraint or a norm. With limited assumption on the noise, we can show that the estimation error can be bounded by simple geometric measures under Gaussian measurement, which subsumes the existing results for specific settings. The experiment results also validate our theoretical analyses. Acknowledgements We thank Sreangsu Acharyya for helpful discussions related to the paper. The research was supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS1422557, CCF-1451986, CNS1314560, IIS-0953274, IIS-1029711, NASA grant NNX12AQ39A, and gifts from Adobe, IBM, and Yahoo. Robust Structured Estimation with Single-Index Models Alquier, P. and Biau, G. Sparse single-index model. Journal of Machine Learning Research, 14:243 280, 2013. Amelunxen, D., Lotz, M., Mc Coy, M. B., and Tropp, J. A. Living on the edge: Phase transitions in convex programs with random data. Inform. Inference, 3(3):224 294, 2014. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1):1 106, 2012. Banerjee, A., Chen, S., Fazayeli, F., and Sivakumar, V. Estimation with norm regularization. In Advances in Neural Information Processing Systems (NIPS), 2014. Bellman, R. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6):284, 1961. Bickel, P. J., Ritov, Y., and Tsybakov, A. B. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705 1732, 2009. Boufounos, P. T and Baraniuk, R. G. 1-bit compressive sensing. In Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on, 2008. Candes, E. and Tao, T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313 2351, 2007. Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. Chatterjee, S., Chen, S., and Banerjee, A. Generalized dantzig selector: Application to the k-support norm. In Advances in Neural Information Processing Systems (NIPS), 2014. Chen, S. and Banerjee, A. One-bit compressed sensing with the k-support norm. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2015a. Chen, S. and Banerjee, A. Structured estimation with atomic norms: General bounds and applications. In Advances in Neural Information Processing Systems, 2015b. Chen, S. and Banerjee, A. Structured matrix recovery via the generalized dantzig selector. In Advances in Neural Information Processing Systems, 2016. Dirksen, S. Dimensionality reduction with subgaussian matrices: A unified theory. Foundations of Computational Mathematics, 16(5):1367 1396, 2016. Ganti, R., Rao, N., Willett, R. M, and Nowak, R. Learning single index models in high dimensions. ar Xiv preprint ar Xiv:1506.08910, 2015. Gopi, S., Netrapalli, P., Jain, P., and Nori, A. One-bit compressed sensing: Provable support and vector recovery. In Proceedings of The 30th International Conference on Machine Learning, 2013. Gordon, Y. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265 289, 1985. Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30, 1963. Horowitz, J. L. and Hardle, W. Direct semiparametric estimation of single-index models with discrete covariates. Journal of the American Statistical Association, 91 (436):1632 1640, 1996. Ichimura, H. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics, 58:71 120, 1993. Jacques, L., Laska, J. N, Boufounos, P. T, and Baraniuk, R. G. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Transactions on Information Theory, 59(4):2082 2102, 2013. Kakade, S., Shamir, O., Sindharan, K., and Tewari, A. Learning exponential families in high-dimensions: Strong convexity and sparsity. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010. Kakade, S. M, Kanade, V., Shamir, O., and Kalai, A. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, 2011. Kalai, A. T and Sastry, R. The isotron algorithm: Highdimensional isotonic regression. In COLT, 2009. Koltchinskii, V. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Mathematics. Springer Berlin Heidelberg, 2011. Lee, A. J. U-Statistics: Theory and Practice. Taylor & Francis, 1990. Li, P. One scan 1-bit compressed sensing. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016. Maurer, A., Pontil, M., and Romera-Paredes, B. An Inequality with Applications to Structured Sparsity and Multitask Dictionary Learning. In Conference on Learning Theory (COLT), 2014. Robust Structured Estimation with Single-Index Models Mc Cullagh, P. Generalized linear models. European Journal of Operational Research, 16(3):285 292, 1984. Negahban, S., Ravikumar, P., Wainwright, M. J., and Yu, B. A unified framework for the analysis of regularized M-estimators. Statistical Science, 27(4):538 557, 2012. Neykov, M., Liu, J. S., and Cai, T. L1-regularized least squares for support recovery of high dimensional single index models with gaussian designs. J. Mach. Learn. Res., 17(1):2976 3012, 2016. Oymak, S. and Soltanolkotabi, M. Fast and reliable parameter estimation from nonlinear observations. ar Xiv preprint ar Xiv:1610.07108, 2016. Oymak, S., Thrampoulidis, C., and Hassibi, B. The squared-error of generalized lasso: A precise analysis. In Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on, 2013. Plan, Y. and Vershynin, R. Robust 1-bit compressed sensing and sparse logistic regression: A convex programming approach. IEEE Transactions on Information Theory, 59(1):482 494, 2013. Plan, Y., Vershynin, R., and Yudovina, E. Highdimensional estimation with geometric constraints. Information and Inference, 2016. Radchenko, P. High dimensional single index models. Journal of Multivariate Analysis, 139:266 282, 2015. Rao, N., Recht, B., and Nowak, R. Universal Measurement Bounds for Structured Sparse Signal Recovery. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. Slawski, M. and Li, P. b-bit marginal regression. In Advances in Neural Information Processing Systems, 2015. Slawski, M. and Li, P. Linear signal recovery from bbit-quantized linear measurements: precise analysis of the trade-off between bit depth and number of measurements. ar Xiv preprint ar Xiv:1607.02649, 2016. Talagrand, M. The Generic Chaining. Springer, 2005. Talagrand, M. Upper and Lower Bounds for Stochastic Processes. Springer, 2014. Tibshirani, R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267 288, 1996. Tropp, J. A. Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory, a Renaissance. 2015. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. In Eldar, Y. and Kutyniok, G. (eds.), Compressed Sensing, chapter 5, pp. 210 268. Cambridge University Press, 2012. Vershynin, R. Estimation in High Dimensions: A Geometric Perspective, pp. 3 66. Springer International Publishing, 2015. Wainwright, M. J. Sharp thresholds for noisy and highdimensional recovery of sparsity using ℓ1-constrained quadratic programming(Lasso). IEEE Transactions on Information Theory, 55:2183 2202, 2009. Yang, Z., Wang, Z., Liu, H., Eldar, Y. C., and Zhang, T. Sparse nonlinear regression: Parameter estimation under nonconvexity. In Proceedings of the 33nd International Conference on Machine Learning, 2016. Yi, X., Wang, Z., Caramanis, C., and Liu, H. Optimal linear estimation under unknown nonlinear transform. In Advances in Neural Information Processing Systems, 2015. Zhang, L., Yi, J., and Jin, R. Efficient algorithms for robust one-bit compressive sensing. In Proceedings of the 31st International Conference on Machine Learning (ICML14), 2014. Zhu, R. and Gu, Q. Towards a Lower Sample Complexity for Robust One-bit Compressed Sensing. In Proceedings of the 32nd International Conference on Machine Learning, 2015.