# structureblind_signal_recovery__1bb3ca86.pdf Structure-Blind Signal Recovery Dmitry Ostrovsky Zaid Harchaoui Anatoli Juditsky Arkadi Nemirovski firstname.lastname@imag.fr We consider the problem of recovering a signal observed in Gaussian noise. If the set of signals is convex and compact, and can be specified beforehand, one can use classical linear estimators that achieve a risk within a constant factor of the minimax risk. However, when the set is unspecified, designing an estimator that is blind to the hidden structure of the signal remains a challenging problem. We propose a new family of estimators to recover signals observed in Gaussian noise. Instead of specifying the set where the signal lives, we assume the existence of a well-performing linear estimator. Proposed estimators enjoy exact oracle inequalities and can be efficiently computed through convex optimization. We present several numerical illustrations that show the potential of the approach. 1 Introduction We consider the problem of recovering a complex-valued signal (xt)t Z from the noisy observations yτ = xτ + σζτ, n τ n. (1) Here n Z+, and ζτ CN(0, 1) are i.i.d. standard complex-valued Gaussian random variables, meaning that ζ0 = ξ1 0 + ıξ2 0 with i.i.d. ξ1 0, ξ2 0 N(0, 1). Our goal is to recover xt, 0 t n, given the sequence of observations yt n, ..., yt up to instant t, a task usually referred to as (pointwise) filtering in machine learning, statistics, and signal processing [5]. The traditional approach to this problem considers linear estimators, or linear filters, which write as τ=0 φτyt τ, 0 t n. Linear estimators have been thoroughly studied in various forms, they are both theoretically attractive [7, 3, 2, 16, 17, 11, 13] and easy to use in practice. If the set X of signals is well-specified, one can usually compute a (nearly) minimax on X linear estimator in a closed form. In particular, if X is a class of smooth signals, such as a H older or a Sobolev ball, then the corresponding estimator is given by the kernel estimator with the properly set bandwidth parameter [16] and is minimax among all possible estimators. Moreover, as shown by [6, 2], if only X is convex, compact, and centrally symmetric, the risk of the best linear estimator of xt is within a small constant factor of the minimax risk over X. Besides, if the set X can be specified in a computationally tractable way, which clearly is still a weaker assumption than classical smoothness assumptions, the best linear estimator can be efficiently computed by solving a convex optimization problem on X. In other words, given a computationally tractable set X on the input, one can compute a nearly-minimax linear estimator and the corresponding (nearly-minimax) risk over X. The strength of this approach, however, comes at LJK, University of Grenoble Alpes, 700 Avenue Centrale, 38401 Domaine Universitaire de Saint-Martind H eres, France. University of Washington, Seattle, WA 98195, USA. Georgia Institute of Technology, Atlanta, GA 30332, USA. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. a price: the set X still must be specified beforehand. Therefore, when one faces a recovery problem without any prior knowledge of X, this approach cannot be implemented. We adopt here a novel approach to filtering, which we refer to as structure-blind recovery. While we do not require X to be specified beforehand, we assume that there exists a linear oracle a wellperforming linear estimator of xt. Previous works [8, 10, 4], following a similar philosophy, proved that one can efficiently adapt to the linear oracle filter of length m = O(n) if the corresponding filter φ is time-invariant, i.e. it recovers the target signal uniformly well in the O(n)-sized neighbourhood of t, and if its ℓ2-norm is small bounded by ρ/ m for a moderate ρ 1. The adaptive estimator is computed by minimizing the ℓ -norm of the filter discrepancy, in the Fourier domain, under the constraint on the ℓ1-norm of the filter in the Fourier domain. Put in contrast to the oracle linear filter, the price for adaptation is proved to be O(ρ3 ln n), with the lower bound of O(ρ ln n) [8, 4]. We make the following contributions: we propose a new family of recovery methods, obtained by solving a least-squares problem constrained or penalized by the ℓ1-norm of the filter in the Fourier domain; we prove exact oracle inequalities for the ℓ2-risk of these methods; we show that the price for adaptation improves upon previous works [8, 4] to O(ρ2 ln n) for the point-wise risk and to O(ρ ln n) for the ℓ2-risk. we present numerical experiments that show the potential of the approach on synthetic and real-world images and signals. Before presenting the theoretical results, let us introduce the notation we use throughout the paper. Filters Let C(Z) be the linear space of all two-sided complex-valued sequences x = {xt C}t Z. For k, k Z we consider finite-dimensional subspaces C(Zk k ) = {x C(Z) : xt = 0, t / [k, k ]} . It is convenient to identify m-dimensional complex vectors, m = k k + 1, with elements of C(Zk k ) by means of the notation: xk k := [xk; ...; xk ] Ck k+1. We associate to linear mappings C(Zk k ) C(Zj j ) (j j +1) (k k+1) matrices with complex entries. The convolution u v of two sequences u, v C(Z) is a sequence with elements τ Z uτvt τ, t Z. Given observations (1) and ϕ C(Zm 0 ) consider the (left) linear estimation of x associated with filter ϕ: bxt = [ϕ y]t (bxt is merely a kernel estimate of xt by a kernel ϕ supported on [0, ..., m]). Discrete Fourier transform We define the unitary Discrete Fourier transform (DFT) operator Fn : Cn+1 Cn+1 by z 7 Fnz, [Fnz]k = (n + 1) 1/2 n X t=0 zt e 2πıkt n+1 , 0 k n. The inverse Discrete Fourier transform (i DFT) operator F 1 n is given by F 1 n := F H n (here AH stands for Hermitian adjoint of A). By the Fourier inversion theorem, F 1 n (Fn z) = z. We denote p usual ℓp-norms on C(Z): x p = (P t Z |xt|p)1/p, p [1, ]. Usually, the argument will be finite-dimensional an element of C(Zk k ); we reserve the special notation x n,p := xn 0 p. Furthermore, DFT allows to equip C(Zn 0) with the norms associated with ℓp-norms in the spectral domain: x n,p := xn 0 p := Fnxn 0 p, p [1, ]; note that unitarity of the DFT implies the Parseval identity: x n,2 = x n,2. Finally, c, C, and C stand for generic absolute constants. 2 Oracle inequality for constrained recovery Given observations (1) and ϱ > 0, we first consider the constrained recovery bxcon given by [bxcon]t = [bϕ y]t, t = 0, ..., n, where bϕ is an optimal solution of the constrained optimization problem min ϕ C(Zn 0 ) y ϕ y n,2 : ϕ n,1 ϱ/ n + 1 . (2) The constrained recovery estimator minimizes a least-squares fit criterion under a constraint on ϕ n,1 = Fnϕn 0 1, that is an ℓ1 constraint on the discrete Fourier transform of the filter. While the least-squares objective naturally follows from the Gaussian noise assumption, the constraint can be motivated as follows. Small-error linear filters Linear filter ϕo with a small ℓ1 norm in the spectral domain and small recovery error exists, essentially, whenever there exists a linear filter with small recovery error [8, 4]. Indeed, let us say that x C(Zn 0) is simple [4] with parameters m Z+ and ρ 1 if there exists φo C(Zm 0 ) such that for all m τ 2m, E |xτ [φo y]τ|2 1/2 σρ m + 1. (3) In other words, x is (m, ρ)-simple if there exists a hypothetical filter φo of the length at most m + 1 which recovers xτ with squared risk uniformly bounded by σ2ρ2 m+1 in the interval m τ 2m. Note that (3) clearly implies that φo 2 ρ/ m + 1, and that |[x φo x]τ| σρ/ m + 1 τ, m τ 2m. Now, let n = 2m, and let ϕo = φo φo Cn+1. As proved in [15, Appendix C], we have ϕo n,1 2ρ2/ and, for a moderate absolute constant c, x ϕo y n,2 cσρ2p 1 + ln[1/α] (5) with probability 1 α. To summarize, if x is (m, ρ)-simple, i.e., when there exists a filter φo of length m + 1 which recovers x with small risk on the interval [ m, 2m], then the filter ϕo = φo φo of the length at most n + 1, with n = 2m, has small norm ϕo n,1 and recovers the signal x with (essentially the same) small risk on the interval [0, n]. Hidden structure The constrained recovery estimator is completely blind to a possible hidden structure of the signal, yet can seamlessly adapt to it when such a structure exists, in a way that we can rigorously establish. Using the right-shift operator on C(Z), [ x]t = xt 1, we formalize the hidden structure as an unknown shift-invariant linear subspace of C(Z), S = S, of a small dimension s. We do not assume that x belongs to that subspace. Instead, we make a more general assumption that x is close to this subspace, that is, it may be decomposed into a sum of a component that lies in the subspace and a component whose norm we can control. Assumption A We suppose that x admits the decomposition x = x S + ε, x S S, where S is an (unknown) shift-invariant, S = S, subspace of C(Z) of dimension s, 1 s n+1, and ε is small , namely, τε n,2 σκ, 0 τ n. Shift-invariant subspaces of C(Z) are exactly the sets of solutions of homogeneous linear difference equations with polynomial operators. This is summarized by the following lemma (we believe it is a known fact; for completeness we provide a proof in [15, Appendix C]). Lemma 2.1. Solution set of a homogeneous difference equation with a polynomial operator p( ), = 0, t Z, (6) with deg(p( )) = s, p(0) = 1, is a shift-invariant subspace of C(Z) of dimension s. Conversely, any shift-invariant subspace S C(Z), S S, dim(S) = s < , is the set of solutions of some homogeneous difference equation (6) with deg(p( )) = s, p(0) = 1. Moreover, such p( ) is unique. On the other hand, for any polynomial p( ), solutions of (6) are exponential polynomials [? ] with frequencies determined by the roots of p( ). For instance, discrete-time polynomials xt = Ps 1 k=0 cktk, t Z of degree s 1 (that is, exponential polynomials with all zero frequencies) form a linear space of dimension s of solutions of the equation (6) with a polynomial p( ) = (1 )s with a unique root of multiplicity s, having coefficients pk = ( 1)k s k . Naturally, signals which are close, in the ℓ2 distance, to discrete-time polynomials are Sobolev-smooth functions sampled over the regular grid [10]. Sum of harmonic oscillations xt = Ps k=1 ckeıωkt, ωk [0, 2π) being all different, is another example; here, p( ) = Qs k=1(1 eıωk ). We can now state an oracle inequality for the constrained recovery estimator; see [15, Appendix B]. Theorem 2.1. Let ϱ 1, and let ϕo C(Zn 0) be such that Suppose that Assumption A holds for some s Z+ and κ < . Then for any α, 0 < α 1, it holds with probability at least 1 α: x bxcon n,2 x ϕo y n,2 + Cσ q ln [1/α] + ln [n/α] . (7) When considering simple signals, Theorem 2.1 gives the following. Corollary 2.1. Assume that signal x is (m, ρ)-simple, ρ 1 and m Z+. Let n = 2m, ϱ 2ρ2, and let Assumption A hold for some s Z+ and κ < . Then for any α, 0 < α 1, it holds with probability at least 1 α: x bxcon n,2 Cσρ2p ln[1/α] + C σ q ln [1/α] + ln [n/α] . Adaptation and price The price for adaptation in Theorem 2.1 and Corollary 2.1 is determined by three parameters: the bound on the filter norm ϱ, the deterministic error κ, and the subspace dimension s. Assuming that the signal to recover is simple, and that ϱ = 2ρ2, let us compare the magnitude of the oracle error to the term of the risk which reflects price of adaptation . Typically (in fact, in all known to us cases of recovery of signals from a shift-invariant subspace), the parameter ρ is at least s. Therefore, the bound (5) implies the typical bound O(σ γρ2) = σs γ for the term x ϕo y n,2 (we denote γ = ln(1/α)). As a result, for instance, in the parametric situation , when the signal belongs or is very close to the subspace, that is when κ = O(ln(n)), the price of adaptation O σ[s + ρ2(γ + γ ln n)]1/2 is much smaller than the bound on the oracle error. In the nonparametric situation , when κ = O(ρ2), the price of adaptation has the same order of magnitude as the oracle error. Finally, note that under the premise of Corollary 2.1 we can also bound the pointwise error. We state the result for ϱ = 2ρ2 for simplicity; the proof can be found in [15, Appendix B]. Theorem 2.2. Assume that signal x is (m, ρ)-simple, ρ 1 and m Z+. Let n = 2m, ϱ = 2ρ2, and let Assumption A hold for some s Z+ and κ < . Then for any α, 0 < α 1, the constrained recovery bxcon satisfies |xn [bxcon]n| C σρ m + 1 ln[n/α] + ρ q ln [1/α] + s . 3 Oracle inequality for penalized recovery To use the constrained recovery estimator with a provable guarantee, see e.g. Theorem 2.1, one must know the norm of a small-error linear filter ϱ, or at least have an upper bound on it. However, if this parameter is unknown, but instead the noise variance is known (or can be estimated from data), we can build a more practical estimator that still enjoys an oracle inequality. The penalized recovery estimator [bxpen]t = [bϕ y]t is an optimal solution to a regularized leastsquares minimization problem, where the regularization penalizes the ℓ1-norm of the filter in the Fourier domain: bϕ Argmin ϕ C(Zn 0 ) y ϕ y 2 n,2 + λ n + 1 ϕ n,1 . (8) Similarly to Theorem 2.1, we establish an oracle inequality for the penalized recovery estimator. Theorem 3.1. Let Assumption A hold for some s Z+ and κ < , and let ϕo C(Zn 0) satisfy ϕo n,1 ϱ/ n + 1 for some ϱ 1. 1o. Suppose that the regularization parameter of penalized recovery bxpen satisfies λ λ, λ := 60σ2 ln[63n/α]. Then, for 0 < α 1, it holds with probability at least 1 α: x bxpen n,2 x ϕo y n,2 + C p s + (bϱ + 1)κ p where bϱ := n + 1 bϕ n,1. 2o. Moreover, if κ κ, κ := 10 ln[42n/α] p ln [16/α] , and λ 2λ, one has x bxpen n,2 x ϕo y n,2 + C p ϱλ + C σ s. The proof closely follows that of Theorem 2.1 and can also be found in [15, Appendix B]. 4 Discussion There is some redundancy between simplicity of a signal, as defined by (3), and Assumption A. Usually a simple signal or image x is also close to a low-dimensional subspace of C(Z) (see, e.g., [10, section 4]), so that Assumption A holds automatically . Likewise, x is almost simple when it is close to a low-dimensional time-invariant subspace. Indeed, if x C(Z) belongs to S, i.e. Assumption A holds with κ = 0, one can easily verify that for n s there exists a filter φo C(Zn n) such that s/(n + 1), and xτ = [φo x]τ, τ Z . (9) See [15, Appendix C] for the proof. This implies that x can be recovered efficiently from observations (1): E |xτ [φo y]τ|2 1/2 σ r s n + 1. In other words, if instead of the filtering problem we were interested in the interpolation problem of recovering xt given 2n + 1 observations yt n, ..., yt+n on the left and on the right of t, Assumption A would imply a kind of simplicity of x. On the other hand, it is clear that Assumption A is not sufficient to imply the simplicity of x with respect to the filtering , in the sense of the definition we use in this paper, when we are allowed to use only observations on the left of t to compute the estimation of xt. Indeed, one can see, for instance, that already signals from the parametric family Xα = {x C(Z) : xτ = cατ, c C}, with a given |α| > 1, which form a one-dimensional space of solutions of the equation xτ = αxτ 1, cannot be estimated with small risk at t using only observations on the left of t (unless c = 0), and thus are not simple in the sense of (3). Of course, in the above example, the difficulty of the family Xα is due to instability of solutions of the difference equation which explode when τ + . Note that signals x Xα with |α| 1 (linear functions, oscillations, or damped oscillations) are simple. More generally, suppose that x satisfies a difference equation of degree s: where p(z) = Ps i=0 pizi is the corresponding characteristic polynomial and is the right shift operator. When p(z) is unstable has roots inside the unit circle (depending on initial conditions ) the set of solutions to the equation (10) contains difficult to filter signals. Observe that stability of solutions is related to the direction of the time axis; when the characteristic polynomial p(z) has roots outside the unit circle, the corresponding solutions may be left unstable increase exponentially when τ . In this case right filtering estimating xτ using observations on the right of τ will be difficult. A special situation where interpolation and filtering is always simple arises when the characteristic polynomial of the difference equation has all its roots on the unit circle. In this case, solutions to (10) are generalized harmonic oscillations (harmonic oscillations modulated by polynomials), and such signals are known to be simple. Theorem 4.1 summarizes the properties of the solutions of (10) in this particular case; see [15, Appendix C] for the proof. Theorem 4.1. Let s be a positive integer, and let p = [p0; ...; ps] Cs+1 be such that the polynomial p(z) = Ps i=0 pizi has all its roots on the unit circle. Then for every integer m satisfying m m(s) := Cs2 ln(s + 1), one can point out q Cm+1 such that any solution to (10) satisfies xτ = [q x]τ, τ Z, and q 2 ρ(s, m)/ m where ρ(s, m) = C min n s3/2 ln[ms] o . (11) 5 Numerical experiments We present preliminary results on simulated data of the proposed adaptive signal recovery methods in several application scenarios. We compare the performance of the penalized ℓ2-recovery of Sec. 3 to that of the Lasso recovery of [1] in signal and image denoising problems. Implementation details for the penalized ℓ2-recovery are given in Sec. 6. Discussion of the discretization approach underlying the competing Lasso method can be found in [1, Sec. 3.6]. We follow the same methodology in both signal and image denoising experiments. For each level of the signal-to-noise ratio SNR {1, 2, 4, 8, 16}, we perform N Monte-Carlo trials. In each trial, we generate a random signal x on a regular grid with n points, corrupted by the i.i.d. Gaussian noise of variance σ2. The signal is normalized: x 2 = 1 so SNR 1 = σ n. We set the regularization penalty in each method as follows. For penalized ℓ2-recovery (8), we use λ = 2σ2 log[63n/α] with α = 0.1. For Lasso [1], we use the common setting λ = σ 2 log n. We report experimental results by plotting the ℓ2-error bx x 2, averaged over N Monte-Carlo trials, versus the inverse of the signal-to-noise ratio SNR 1. Signal denoising We consider denoising of a one-dimensional signal in two different scenarios, fixing N = 100 and n = 100. In the Random Spikes scenario, the signal is a sum of 4 harmonic oscillations, each characterized by a spike of a random amplitude at a random position in the continuous frequency domain [0, 2π]. In the Coherent Spikes scenario, the same number of spikes is 0.06 0.12 0.25 0.5 1 2 4 Lasso [1] Pen. 2-rec. 0.06 0.12 0.25 0.5 1 2 4 0.025 Lasso [1] Pen. 2-rec. 0.06 0.12 0.25 0.5 1 2 4 Lasso [1] Pen. 2-rec. 0.06 0.12 0.25 0.5 1 2 4 Lasso [1] Pen. 2-rec. Figure 1: Signal and image denoising in different scenarios, left to right: Random Spikes, Coherent Spikes, Random Spikes-2D, and Coherent Spikes-2D. The steep parts of the curves on high noise levels correspond to observations being thresholded to zero. sampled by pairs. Spikes in each pair have the same amplitude and are separated by only 0.1 of the DFT bin 2π/n which could make recovery harder due to high signal coherency. However, in practice we found Random Spikes to be slightly harder than Coherent Spikes for both methods, see Fig. 1. As Fig. 1 shows, the proposed penalized ℓ2-recovery outperforms the Lasso method for all noise levels. The performance gain is particularly significant for high signal-to-noise ratios. Image Denoising We now consider recovery of an unknown regression function f on the regular grid on [0, 1]2 given the noisy observations: yτ = xτ + σζτ, τ {0, 1, ..., m 1}2 , (12) where xτ = f(τ/m). We fix N = 40, and the grid dimension m = 40; the number of samples is then n = m2. For the penalized ℓ2-recovery, we implement the blockwise denoising strategy (see Appendix for the implementation details) with just one block for the entire image. We present additional numerical illustrations in the supplementary material. We study three different scenarios for generating the ground-truth signal in this experiment. The first two scenarios, Random Spikes-2D and Coherent Spikes-2D, are two-dimensional counterparts of those studied in the signal denoising experiment: the ground-truth signal is a sum of 4 harmonic oscillations in R2 with random frequencies and amplitudes. The separation in the Coherent Spikes2D scenario is 0.2π/m in each dimension of the torus [0, 2π]2. The results for these scenarios are shown in Fig. 1. Again, the proposed penalized ℓ2-recovery outperforms the Lasso method for all noise levels, especially for high signal-to-noise ratios. In scenario Dimension Reduction-2D we investigate the problem of estimating a function with a hidden low-dimensional structure. We consider the single-index model of the regression function: f(t) = g(θT t), g( ) S1 β(1). (13) Here, S1 β(1) = {g : R R, g(β)( ) 2 1} is the Sobolev ball of smooth periodic functions on [0, 1], and the unknown structure is formalized as the direction θ. In our experiments we sample the direction θ uniformly at random and consider different values of the smoothness index β. If it is known a priori that the regression function possesses the structure (13), and only the index is unknown, one can use estimators attaining one-dimensional rates of recovery; see e.g. [12] and references therein. In contrast, our recovery algorithms are not aware of the underlying structure but might still adapt to it. As shown in Fig. 2, the ℓ2-recovery performs well in this scenario despite the fact that the available theoretical bounds are pessimistic. For example, the signal (13) with a smooth g can be approximated by a small number of harmonic oscillations in R2. As follows from the proof of [9, Proposition 10] combined with Theorem 4.1, for a sum of k harmonic oscillations in Rd one can point out a reproducing linear filter with ϱ(k) = O(k2d) (neglecting the logarithmic factors), i.e. the theoretical guarantee is quite conservative for small values of β. 6 Details of algorithm implementation Here we give a brief account of some techniques and implementation tricks exploited in our codes. Solving the optimization problems Note that the optimization problems (2) and (8) underlying the proposed recovery algorithms are well structured Second-Order Conic Programs (SOCP) and