Regularity Properties for Sparse Regression

Article · March 2016 with 164 Reads

Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and $\ell _q$ sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given dataset. This is problematic, because they are at the core of the theory of sparse regression. Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications. However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, $\ell _q$ sensitivity, has certain desirable properties. This condition is weaker and more general than the others. We show that it holds with high probability in models where the parent population is well behaved, and that it is robust to certain data processing steps. These results are desirable, as they provide guidance about when the condition, and more generally the theory of sparse regression, may be relevant in the analysis of high-dimensional correlated observational data.

On the Beta Prime Prior for Scale Parameters in High-Dimensional Bayesian Regression Models

Preprint

Jan 2019

We study high-dimensional Bayesian linear regression with a general beta prime distribution for the scale parameter. Under the assumption of sparsity, we show that appropriate selection of the hyperparameters in the beta prime prior leads to the (near) minimax posterior contraction rate when $p \gg n$. For finite samples, we propose a data-adaptive method for estimating the hyperparameters based on marginal maximum likelihood (MML). This enables our prior to adapt to both sparse and dense settings, and under our proposed empirical Bayes procedure, the MML estimates are never at risk of collapsing to zero. We derive efficient Monte Carlo EM and variational EM algorithms for implementing our model, which are available in the R package NormalBetaPrime. Simulations and analysis of a gene expression data set illustrate our model's self-adaptivity to varying levels of sparsity and signal strengths.

Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets

Article

Dec 2018
J AM STAT ASSOC

Spatial regression models have been widely used to describe the relationship between a response variable and some explanatory variables over a region of interest, taking into account the spatial dependence of the observations. In many applications, relationships between response variables and covariates are expected to exhibit complex spatial patterns. We propose a new approach, referred to as SCC, to detect spatially clustered patterns in the regression coefficients. It incorporates spatial neighborhood information through a carefully constructed regularization to automatically detect change points in space and to achieve computational scalability. Our numerical studies suggest that SCC works very effectively, capturing not only clustered coefficients, but also smoothly varying coefficients because of its strong local adaptivity. This flexibility allows researchers to explore various spatial structures in regression coefficients. We also establish theoretical properties of SCC. We use SCC to explore the relationship between the temperature and salinity of sea water in the Atlantic basin; this can provide important insights about the evolution of individual water masses and the pathway and strength of meridional overturning circulation in oceanography.

Regression shrinkage and selection via the LASSO

Article

Jan 1996

Rob Tibshirani

Open question: deterministic UUP matrices

T Tao

Features of big data and sparsest solution in high confidence setPast, Present, and Future of Statistical Science

Jan 2014
507-523

J X Fan
C Lin
D L Genest
G Banks

Fan, J.: Features of big data and sparsest solution in high confidence set. In "Past, Present, and Future of Statistical Science" (X, Lin, C. Genest, D. L. Banks, G. Molenberghs, D. W. Scott, J.-L. Wang, Eds.), Chapman & Hall, New York pp. 507-523 (2014)

Statistics for High-Dimensional Data

Jan 2011

P Bühlmann
S Van De Geer

Bühlmann, P., van de Geer, S.: Statistics for High-Dimensional Data, 1st edn. Springer Series in Statistics. Springer (2011)

Optimal solutions for sparse principal component analysis

Jan 2008
J MACH LEARN RES
1269-1294

A Aspremont
F Bach
L E Ghaoui

d'Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9, 1269-1294 (2008)

Variable selection via nonconcave penalized likelihood and its oracle properties

Article

Jan 2011
J Am Stat Assoc

Atomic decomposition by basis pursuit[J]

Article

Jan 2001
SIAM REV

Features of Big Data and sparsest solution in high confidence set

Chapter

Mar 2014

Jianqing Fan

Atomic Decomposition by Basis Pursuit

Article

Jan 1998
SIAM J SCI COMPUT

The time-frequency and time-scale communities have recently developed a large number of overcomplete waveform dictionaries-stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the method of frames (MOF), Matching pursuit (MP), and, for special dictionaries, the best orthogonal basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l(1) norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP, and BOB, including better sparsity and superresolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation denoising, and multiscale edge denoising. BP in highly overcomplete dictionaries leads to large-scale optimization problems. With signals of length 8192 and a wavelet packet dictionary, one gets an equivalent linear program of size 8192 by 212,992. Such problems can be attacked successfully only because of recent advances in linear programming by interior-point methods. We obtain reasonable success with a primal-dual logarithmic barrier method and conjugate-gradient solver.

The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing

Article

May 2012

Limit Theorems of Probability Theory: Sequences of Independent Random Variables.

Article

Jan 1996
J R STAT SOC A STAT

The deterministic Lasso

Article

Jan 2007

Sara Van De Geer

Limit Theorem of Probability Theory: Sequences of Independent Random Variables

Article

Jan 1975

Valentin V. Petrov

The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing

Article

Full-text available

Feb 2014
IEEE T INFORM THEORY

This paper deals with the computational complexity of conditions which guarantee that the NP-hard problem of finding the sparsest solution to an underdetermined linear system can be solved by efficient algorithms. In the literature, several such conditions have been introduced. The most well-known ones are the mutual coherence, the restricted isometry property (RIP), and the nullspace property (NSP). While evaluating the mutual coherence of a given matrix is easy, it has been suspected for some time that evaluating RIP and NSP is computationally intractable in general. We confirm these conjectures by showing that for a given matrix ${mbi{A}}$ and positive integer $k$, computing the best constants for which the RIP or NSP hold is, in general, NP-hard. These results are based on the fact that determining the spark of a matrix is NP-hard, which is also established in this paper. Furthermore, we also give several complexity statements about problems related to the above concepts.

Certifying the Restricted Isometry Property is Hard

Article

Full-text available

Jun 2013
IEEE T INFORM THEORY

This paper is concerned with an important matrix condition in compressed sensing known as the restricted isometry property (RIP). We demonstrate that testing whether a matrix satisfies RIP is NP-hard. As a consequence of our result, it is impossible to efficiently test for RIP provided P ≠ NP.

High-dimensional covariance estimation by minimizing -penalized log-determinant divergence

Article

Full-text available

Jan 2011

Given i.i.d. observations of a random vector X∈ℝp, we study the problem of estimating both its covariance matrix Σ*, and its inverse covariance or concentration matrix Θ*=(Σ*)−1. When X is multivariate Gaussian, the non-zero structure of Θ* is specified by the graph of an associated Gaussian Markov random field; and a popular estimator for such sparse Θ* is the ℓ1-regularized Gaussian MLE. This estimator is sensible even for for non-Gaussian X, since it corresponds to minimizing an ℓ1-penalized log-determinant Bregman divergence. We analyze its performance under high-dimensional scaling, in which the number of nodes in the graph p, the number of edges s, and the maximum node degree d, are allowed to grow as a function of the sample size n. In addition to the parameters (p,s,d), our analysis identifies other key quantities that control rates: (a) the ℓ∞-operator norm of the true covariance matrix Σ*; and (b) the ℓ∞ operator norm of the sub-matrix Γ*SS, where S indexes the graph edges, and Γ*=(Θ*)−1⊗(Θ*)−1; and (c) a mutual incoherence or irrepresentability measure on the matrix Γ* and (d) the rate of decay 1/f(n,δ) on the probabilities {|Σ̂nij−Σ*ij|>δ}, where Σ̂n is the sample covariance based on n samples. Our first result establishes consistency of our estimate Θ̂ in the elementwise maximum-norm. This in turn allows us to derive convergence rates in Frobenius and spectral norms, with improvements upon existing results for graphs with maximum node degrees $\ensuremath{d}=o(\sqrt{\ensuremath{s}})$ . In our second result, we show that with probability converging to one, the estimate Θ̂ correctly specifies the zero pattern of the concentration matrix Θ*. We illustrate our theoretical results via simulations for various graphs and problem parameters, showing good correspondences between the theoretical predictions and behavior in simulations.

Statistics for High-Dimensional Data: Method, Theory and Applications

Book

Jan 2011

Testing the Nullspace Property using Semidefinite Programming

Article

Mar 2011
MATH PROGRAM

Recent results in compressed sensing show that, under certain conditions, the sparsest solution to an underdetermined set of linear equations can be recovered by solving a linear program. These results either rely on computing sparse eigenvalues of the design matrix or on properties of its nullspace. So far, no tractable algorithm is known to test these conditions and most current results rely on asymptotic properties of random matrices. Given a matrix A, we use semidefinite relaxation techniques to test the nullspace property on A and show on some numerical examples that these relaxation bounds can prove perfect recovery of sparse solutions with relatively high cardinality. KeywordsCompressed sensing–Nullspace property–Semidefinite programming–Restricted isometry constant

Certifying the restricted isometry property is hard

Article

Apr 2012
IEEE T INFORM THEORY

The p versus NP problem. www.claymath.org/millennium/P vs NP/Official Problem Description

Jan 2000

S Cook

Cook, S.: The p versus NP problem. www.claymath.org/millennium/P vs NP/Official Problem Description.pdf (2000)

Computing performance guarantees for compressed sensing

Conference Paper

Mar 2008
Acoust Speech Signal Process

There are various conditions on the CS matrix for unique and stable recovery. These include universality, or spark, and UUP. Furthermore, quantitative bounds on the stability depend on related properties of the CS matrix. The construction of good CS matrices - satisfying the various properties - is key to successful practical applications of compressive sensing. Unfortunately, verifying the satisfiability of any of these properties for a given CS matrix involves infeasible combinatorial search. Our methods use i and semidefinite relaxation into a convex problem. Given a set of candidate CS matrices, our approach provides tools for the selection of good CS matrices with verified and quantitatively favorable performance.

Computational Complexity: A Modern Approach

Book

Jan 2009

Cambridge Core - Algorithmics, Complexity, Computer Algebra, Computational Geometry - Computational Complexity - by Sanjeev Arora

Restricted Eigenvalue Properties for Correlated Gaussian Designs.

Article

Aug 2010
J MACH LEARN RES

Methods based on`1-relaxation, such as basis pursuit and the Lasso, are very po pular for sparse re- gression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if an d only if the design matrix X satisfies the restricted nullspace property, and (2) the squared `2-error of a Lasso estimate decays at the minimax optimal rate k log p n , where k is the sparsity of the p-dimensional regression problem with additive Gaussian noise, whenever the design satisfies a restricted e igenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfi ed when all entries of X are independent and identically distributed (i.i.d.), or the rows are unita ry. This paper proves directly that the re- stricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theo- retical guarantees on `1-relaxations to a much broader class of problems than the case of completely independent or unitary designs.

Rate Minimaxity of the Lasso and Dantzig Selector for the l q Loss in l r Balls

Article

Full-text available

Mar 2010
J MACH LEARN RES

We consider the estimation of regression coefficients in a high-dimensional linear model. For regression coefficients in lr balls, we provide lower bounds for the minimax lq risk and minimax quantiles of the lq loss for all design matrices. Under an l0 sparsity condition on a target coefficient vector, we sharpen and unify existing oracle inequalities for the Lasso and Dantzig selector. We derive oracle inequalities for target coefficient vectors with many small elements and smaller threshold levels than the universal threshold. These oracle inequalities provide sufficient conditions on the design matrix for the rate minimaxity of the Lasso and Dantzig selector for the lq risk and loss in lr balls, 0≤ r≤ 1≤ q≤ ∞. By allowing q=∞, our risk bounds imply the variable selection consistency of threshold Lasso and Dantzig selectors.

High-dimensional instrumental variables regression and confidence sets

Article

May 2011

We propose an instrumental variables method for estimation in linear models with endogenous regressors in the high-dimensional setting where the sample size n can be smaller than the number of possible regressors K, and L>=K instruments. We allow for heteroscedasticity and we do not need a prior knowledge of variances of the errors. We suggest a new procedure called the STIV (Self Tuning Instrumental Variables) estimator, which is realized as a solution of a conic optimization program. The main results of the paper are upper bounds on the estimation error of the vector of coefficients in l_p-norms for 1<= p<=\infty that hold with probability close to 1, as well as the corresponding confidence intervals. All results are non-asymptotic. These bounds are meaningful under the assumption that the true structural model is sparse, i.e., the vector of coefficients has few non-zero coordinates (less than the sample size n) or many coefficients are too small to matter. In our IV regression setting, the standard tools from the literature on sparsity, such as ther restricted eigenvalue assumption are inapplicable. Therefore, for our analysis we develop a new approach based on data-driven sensitivity characteristics. We show that, under appropriate assumptions, a thresholded STIV estimator correctly selects the non-zero coefficients with probability close to 1. The price to pay for not knowing which coefficients are non-zero and which instruments to use is of the order \sqrt{\log(L)} in the rate of convergence. We extend the procedure to deal with high-dimensional problems where some instruments can be non-valid. We obtain confidence intervals for non-validity indicators and we suggest a procedure, which correctly detects the non-valid instruments with probability close to 1.

Reconstruction From Anisotropic Random Measurements

Article

Jun 2011
IEEE T INFORM THEORY

Random matrices are widely used in sparse recovery problems, and the relevant properties of matrices with i.i.d. entries are well understood. The current paper discusses the recently introduced Restricted Eigenvalue (RE) condition, which is among the most general assumptions on the matrix, guaranteeing recovery. We prove a reduction principle showing that the RE condition can be guaranteed by checking the restricted isometry on a certain family of low-dimensional subspaces. This principle allows us to establish the RE condition for several broad classes of random matrices with dependent entries, including random matrices with subgaussian rows and non-trivial covariance structure, as well as matrices with independent rows, and uniformly bounded entries.

Introduction to the non-asymptotic analysis of random matrices

Article

Nov 2010

Roman Vershynin

This is a tutorial on some basic non-asymptotic methods and concepts in random matrix theory. The reader will learn several tools for the analysis of the extreme singular values of random matrices with independent rows or columns. Many of these methods sprung off from the development of geometric functional analysis since the 1970's. They have applications in several fields, most notably in theoretical computer science, statistics and signal processing. A few basic applications are covered in this text, particularly for the problem of estimating covariance matrices in statistics and for validating probabilistic constructions of measurement matrices in compressed sensing. These notes are written particularly for graduate students and beginning researchers in different areas, including functional analysts, probabilists, theoretical statisticians, electrical engineers, and theoretical computer scientists.

On the Conditions used to Prove Oracle Results for the Lasso

Article

Full-text available

Oct 2009

Oracle inequalities and variable selection properties for the Lasso in linear models have been established under a variety of different assumptions on the design matrix. We show in this paper how the different conditions and concepts relate to each other. The restricted eigenvalue condition (Bickel et al., 2009) or the slightly weaker compatibility condition (van de Geer, 2007) are sufficient for oracle results. We argue that both these conditions allow for a fairly general class of design matrices. Hence, optimality of the Lasso for prediction and estimation holds for more general situations than what it appears from coherence (Bunea et al, 2007b,c) or restricted isometry (Candes and Tao, 2005) assumptions. Comment: 33 pages, 1 figure

The Dantzig selector and sparsity oracle inequalities

Article

Sep 2009
BERNOULLI

Vladimir Koltchinskii

Let \[Y_j=f_*(X_j)+\xi_j,\qquad j=1,...,n,\] where $X,X_1,...,X_n$ are i.i.d. random variables in a measurable space $(S,\mathcal{A})$ with distribution $\Pi$ and $\xi,\xi_1,... ,\xi_n$ are i.i.d. random variables with ${\mathbb{E}}\xi=0$ independent of $(X_1,...,X_n).$ Given a dictionary $h_1,...,h_N:S\mapsto{\mathbb{R}},$ let $f_{\lambda}:=\sum_{j=1}^N\lambda_jh_j$, $\lambda=(\lambda_1,...,\lambda_N)\in{\mathbb{R}}^N.$ Given $\varepsilon>0,$ define \[\hat{\Lambda}_{\varepsilon}:=\Biggl\{\lam bda\in{\mathbb{R}}^N:\max_{1\leq k\leq N}\Biggl|n^{-1}\sum_{j=1}^n\big l(f_{\lambda}(X_j)-Y_j\bigr)h_k(X_j)\Biggr|\leq\varepsilon \Biggr\}\] and \[\hat{\lambda}:=\hat{\lambda}^{\varepsilon}\in \operatorname {Arg min}\limits_{\lambda\in\hat{\Lambda}_{\varepsilon}}\|\lambda\|_{\ell_1}.\] In the case where $f_*:=f_{\lambda^*},\lambda^*\in {\mathbb{R}}^N,$ Candes and Tao [Ann. Statist. 35 (2007) 2313--2351] suggested using $\hat{\lambda}$ as an estimator of $\lambda^*.$ They called this estimator ``the Dantzig selector''. We study the properties of $f_{\hat{\lambda}}$ as an estimator of $f_*$ for regression models with random design, extending some of the results of Candes and Tao (and providing alternative proofs of these results). Comment: Published in at http://dx.doi.org/10.3150/09-BEJ187 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties

Article

Feb 2001
J Am Stat Assoc

Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.

Compressed Sensing and Redundant Dictionaries

Article

Full-text available

Jun 2008
IEEE T INFORM THEORY

This paper extends the concept of compressed sensing to signals that are not sparse in an orthonormal basis but rather in a redundant dictionary. It is shown that a matrix, which is a composition of a random matrix of certain type and a deterministic dictionary, has small restricted isometry constants. Thus, signals that are sparse with respect to the dictionary can be recovered via basis pursuit (BP) from a small number of random measurements. Further, thresholding is investigated as recovery algorithm for compressed sensing, and conditions are provided that guarantee reconstruction with high probability. The different schemes are compared by numerical experiments.

Decoding by Linear Programming

Article

Jan 2006
IEEE T INFORM THEORY

This paper considers a natural error correcting problem with real valued input/output. We wish to recover an input vector f∈Rn from corrupted measurements y=Af+e. Here, A is an m by n (coding) matrix and e is an arbitrary and unknown vector of errors. Is it possible to recover f exactly from the data y? We prove that under suitable conditions on the coding matrix A, the input f is the unique solution to the ℓ1-minimization problem (||x||ℓ1:=Σi|xi|) min(g∈Rn) ||y - Ag||ℓ1 provided that the support of the vector of errors is not too large, ||e||ℓ0:=|{i:ei ≠ 0}|≤ρ·m for some ρ>0. In short, f can be recovered exactly by solving a simple convex optimization problem (which one can recast as a linear program). In addition, numerical experiments suggest that this recovery procedure works unreasonably well; f is recovered exactly even in situations where a significant fraction of the output is corrupted. This work is related to the problem of finding sparse solutions to vastly underdetermined systems of linear equations. There are also significant connections with the problem of recovering signals from highly incomplete measurements. In fact, the results introduced in this paper improve on our earlier work. Finally, underlying the success of ℓ1 is a crucial property we call the uniform uncertainty principle that we shall describe in detail.

Uncertainty Principles and Ideal Atomic Decomposition

Article

Dec 2001
IEEE T INFORM THEORY

Suppose a discrete-time signal S(t), 0&les;t<N, is a superposition of atoms taken from a combined time-frequency dictionary made of spike sequences 1{t=τ} and sinusoids exp{2πiwt/N}/√N. Can one recover, from knowledge of S alone, the precise collection of atoms going to make up S? Because every discrete-time signal can be represented as a superposition of spikes alone, or as a superposition of sinusoids alone, there is no unique way of writing S as a sum of spikes and sinusoids in general. We prove that if S is representable as a highly sparse superposition of atoms from this time-frequency dictionary, then there is only one such highly sparse representation of S, and it can be obtained by solving the convex optimization problem of minimizing the l1 norm of the coefficients among all decompositions. Here “highly sparse” means that Nt+Nw<√N/2 where Nt is the number of time atoms, Nw is the number of frequency atoms, and N is the length of the discrete-time signal. Underlying this result is a general l1 uncertainty principle which says that if two bases are mutually incoherent, no nonzero signal can have a sparse representation in both bases simultaneously. For the above setting, the bases are sinusoids and spikes, and mutual incoherence is measured in terms of the largest inner product between different basis elements. The uncertainty principle holds for a variety of interesting basis pairs, not just sinusoids and spikes. The results have idealized applications to band-limited approximation with gross errors, to error-correcting encryption, and to separation of uncoordinated sources. Related phenomena hold for functions of a real variable, with basis pairs such as sinusoids and wavelets, and for functions of two variables, with basis pairs such as wavelets and ridgelets. In these settings, if a function f is representable by a sufficiently sparse superposition of terms taken from both bases, then there is only one such sparse representation; it may be obtained by minimum l1 norm atomic decomposition. The condition “sufficiently sparse” becomes a multiscale condition; for example, that the number of wavelets at level j plus the number of sinusoids in the jth dyadic frequency band are together less than a constant times 2j/2

Atomic Decomposition by Basis Pursuit

Article

Jan 1998
SIAM REV

The Time-Frequency and Time-Scale communities have recently developed a large number of overcomplete waveform dictionaries --- stationary wavelets, wavelet packets, cosine packets, chirplets, and warplets, to name a few. Decomposition into overcomplete systems is not unique, and several methods for decomposition have been proposed, including the Method of Frames (MOF), Matching Pursuit (MP), and, for special dictionaries, the Best Orthogonal Basis (BOB). Basis Pursuit (BP) is a principle for decomposing a signal into an "optimal" superposition of dictionary elements, where optimal means having the smallest l 1 norm of coefficients among all such decompositions. We give examples exhibiting several advantages over MOF, MP! and BOB, including better sparsity, and super-resolution. BP has interesting relations to ideas in areas as diverse as ill-posed problems, in abstract harmonic analysis, total variation de-noising, and multi-scale edge de-noising. Basis Pursuit in highly ...

The Dantzig Selector: Statistical Estimation when p is Much Larger than n

Article

Jun 2005
ANN STAT

In many important statistical applications, the number of variables or parameters p is much larger than the number of observations n. Suppose then that we have observations y = X beta + z, where beta epsilon R-p is a parameter vector of interest, X is a data matrix with possibly far fewer rows than columns, n << p, and the z(i)'s are i.i.d. N(0, sigma(2)). Is it possible to estimate beta reliably based on the noisy data y? To estimate beta, we introduce a new estimator-we call it the Dantzig selector-which is a solution to the l(1)-regularization problem (min) ((beta) over tilde epsilon Rp) parallel to(beta) over tilde parallel to l(1) subject to parallel to X*r parallel to l(infinity) <= (1 + t(-1)) root 2 log p.sigma, where r is the residual vector y - X (beta) over tilde and t is a positive scalar. We show that if X obeys a uniform uncertainty principle (with unit-normed columns) and if the true parameter vector beta is sufficiently sparse (which here roughly guarantees that the model is identifiable), then with very large probability, parallel to(beta) over cap-beta parallel to(2)(l2) <= C-2 . 2 log p . (sigma(2) + Sigma(i) min(beta(2)(i), sigma(2))). Our results are nonasymptotic and we give values for the constant C. Even though n may be much smaller than p, our estimator achieves a loss within a logarithmic factor of the ideal mean squared error one would achieve with an oracle which would supply perfect information about which coordinates are nonzero, and which were above the noise level. In multivariate regression and from a model selection viewpoint, our result says that it is possible nearly to select the best subset of variables by solving a very simple convex program, which, in fact, can easily be recast as a convenient linear program (LP).

Simultaneous analysis of LASSO and Dantzig selector

Article

Full-text available

Feb 2008
ANN STAT

We exhibit an approximate equivalence between the Lasso estimator and Dantzig selector. For both methods we derive parallel oracle inequalities for the prediction risk in the general nonparametric regression model, as well as bounds on the $\ell_p$ estimation loss for $1\le p\le 2$ in the linear model when the number of variables can be much larger than the sample size.

Optimal Solutions for Sparse Principal Component Analysis

Article

Jul 2007
J MACH LEARN RES

Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n^3), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n^3) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases.

Sparsity Oracle Inequalities for the Lasso

Article

Jun 2007

This paper studies oracle properties of $\ell_1$-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in terms of the number of non-zero components of the oracle vector. The results are valid even when the dimension of the model is (much) larger than the sample size and the regression matrix is not positive definite. They can be applied to high-dimensional linear regression, to nonparametric adaptive regression estimation and to the problem of aggregation of arbitrary estimators.

Regularity Properties for Sparse Regression

Recommended publications

A novel k NN Algorithm with Data-driven k Parameter Computation

A robust inverse regression estimator

Testing Shape Constraints in Lasso Regularized Joinpoint Regression

Extensive semi-quantitative regression

Continuity and differentiability of regression M functionals

Dimension reduction in regression without matrix inversion

Who Mediates? : The Political Economy of International Conflict Management

On Directional Regression for Dimension Reduction

Invariance Principles for Recursive Residuals

CWoLa Hunting: Extending the Bump Hunt with Machine Learning

Dimension reduction using the generalized gradient direction

Restrictions on candidate predictors