Endogeneity in Ultrahigh Dimension

ArticleinSSRN Electronic Journal · April 2012with 67 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
DOI: 10.2139/ssrn.2045864 · Source: arXiv
Cite this publication
Abstract
Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogeneous. Yet, endogeneity arises easily in high-dimensional regression due to a large pool of regressors and this causes the inconsistency of the penalized least-squares methods and possible false scientific discoveries. A necessary condition for model selection of a very general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the possible endogeneity, we construct a novel penalized focussed generalized method of moments (FGMM) criterion function and offer a new optimization algorithm. The FGMM is not a smooth function. To establish its asymptotic properties, we first study the model selection consistency and an oracle property for a general class of penalized regression methods. These results are then used to show that the FGMM possesses an oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.

Methods used:

Do you want to read the rest of this article?

Request Full-text Paper PDF
  • Article
    Full-text available
    Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.
  • Article
    Full-text available
    Structural models of demand founded on the classic work of Berry, Levinsohn, and Pakes (1995) link variation in aggregate market shares for a product to the influence of product attributes on heterogeneous consumer tastes. We consider implementing these models in settings with complicated products where consumer preferences for product attributes are sparse, that is, where a small proportion of a high-dimensional product characteristics influence consumer tastes. We propose a multistep estimator to efficiently perform uniform inference. Our estimator employs a penalized pre-estimation model specification stage to consistently estimate nonlinear features of the BLP model. We then perform selection via a Triple-LASSO for explanatory controls, treatment selection controls, and instrument selection. After selecting variables, we use an unpenalized GMM estimator for inference. Monte Carlo simulations verify the performance of these estimators.
  • Article
    Full-text available
    High dimensionality is the problem for many research areas. There are huge number of dimensionality reduction methods are available. Broadly they are grouped into two categories feature selection and feature extraction. Feature selection methods select a subset of features based on some criteria while feature extraction methods transform the data in to the lower dimensional space. This paper presents a survey of classical and modern dimensionality reduction methods. Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization, Artificial Neural Network and Artificial Immune System are few modern nature inspired methods which have been applied for feature selection problem. To find the best feature selection methods, an experiment has been conducted using classical feature extraction, classical feature selection and nature inspired genetic algorithm and particle swarm optimization. Experimental results reveal that modern nature inspired particle swarm optimization is outperforming the other methods.
  • Article
    Full-text available
    Structural models of demand founded on the classic work of Berry, Levinsohn, and Pakes (1995) link variation in aggregate market shares for a product to the influence of product attributes on heterogeneous consumer tastes. We consider implementing these models in settings with complicated products where consumer preferences for product attributes are sparse, that is, where a small proportion of a high-dimensional product characteristics influence consumer tastes. We propose a multistep estimator to efficiently perform uniform inference. Our estimator employs a penalized pre-estimation model specification stage to consistently estimate nonlinear features of the BLP model. We then perform selection via a Triple-LASSO for explanatory controls, treatment selection controls, and instrument selection. After selecting variables, we use an unpenalized GMM estimator for inference. Monte Carlo simulations verify the performance of these estimators.
  • Conference Paper
    Full-text available
    In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce paradigm which allows for massively parallel and distributed execution over a large number of computing nodes. This paper identifies MapReduce issues and challenges in handling Big Data with the objective of providing an overview of the field, facilitating better planning and management of Big Data projects, and identifying opportunities for future research in this field. The identified challenges are grouped into four main categories corresponding to Big Data tasks types: data storage (relational databases and NoSQL stores), Big Data analytics (machine learning and interactive analytics), online processing, and security and privacy. Moreover, current efforts aimed at improving and extending MapReduce to address identified challenges are presented. Consequently, by identifying issues and challenges MapReduce faces when handling Big Data, this study encourages future Big Data research.
  • Article
    In a living organism, tens of thousands of genes are expressed and interact with each other to achieve necessary cellular functions. Gene regulatory networks contain information on regulatory mechanisms and the functions of gene expressions. Thus, incorporating network structures, discerned either through biological experiments or statistical estimations, could potentially increase the selection and estimation accuracy of genes associated with a phenotype of interest. Here, we considered a gene selection problem using gene expression data and the graphical structures found in gene networks. Because gene expression measurements are intermediate phenotypes between a trait and its associated genes, we adopted an instrumental variable regression approach. We treated genetic variants as instrumental variables to address the endogeneity issue. We proposed a two‐step estimation procedure. In the first step, we applied the LASSO algorithm to estimate the effects of genetic variants on gene expression measurements. In the second step, the projected expression measurements obtained from the first step were treated as input variables. A graph‐constrained regularization method was adopted to improve the efficiency of gene selection and estimation. We theoretically showed the selection consistency of the estimation method and derived the L∞ bound of the estimates. Simulation and real data analyses were conducted to demonstrate the effectiveness of our method and to compare it with its counterparts. This article is protected by copyright. All rights reserved.
  • Article
    We study the variable selection problem for a class of generalized linear models with endogenous covariates. Based on the instrumental variable adjustment technology and the smooth-threshold estimating equation (SEE) method, we propose an instrumental variable based variable selection procedure. The proposed variable selection method can attenuate the effect of endogeneity in covariates, and is easy for application in practice. Some theoretical results are also derived such as the consistency of the proposed variable selection procedure and the convergence rate of the resulting estimator. Further, some simulation studies and a real data analysis are conducted to evaluate the performance of the proposed method, and simulation results show that the proposed method is workable.
  • Chapter
    Technological developments have reshaped the scientific thinking, since observation from experiments and real world are massive. Each experiment is able to produce information about the huge number of variables (High dimensional). Unique characteristics of high dimensionality impose various challenges to the traditional learning methods. This paper presents problem produced by high dimensionality and proposes new fuzzy versatile binary PSO (FVBPSO) method. Experimental results show the curse of dimensionality and merits of proposed method on bench marking datasets.
  • Article
    Full-text available
    The recent technology development in the concern of microarray experiments has provided many new potentialities in terms of simultaneous measurement. But new challenges have arisen from these massive quantities of information qualified as Big Data. The challenge consists to extract the main information containing the sense from the data. To this end researchers are using various techniques as “hierarchical clustering”, “mutual information” and “self-organizing maps” to name a few. However, the management and analysis of the millions resulting dataset haven’t yet reached a satisfactory level, and there is no clear consensus about the best method/methods revealing patterns of gene expression. Thus, many efforts are required to strengthen the methodologies for optimal analysis of Big Data. In this paper, we propose a new processing approach which is structured on feature extraction and selection. The feature extraction, is based on correlation and rank analysis and leads to a reduction of the number of variables. The feature selection, consists in eliminating redundant or irrelevant variables, using some adapted techniques of discriminant analysis. Our approach is tested on three type of cancer gene expression microarray and compared with concurrent other approaches. It performs well, in terms of prediction results, computation and processing time.
  • Article
    This chapter reviews the literature on variable selection in nonparametric and semiparametric regression models via shrinkage. We highlight recent developments on simultaneous variable selection and estimation through the methods of least absolute shrinkage and selection operator (Lasso), smoothly clipped absolute deviation (SCAD), or their variants, but restrict our attention to nonparametric and semiparametric regression models. In particular, we consider variable selection in additive models, partially linear models, functional/varying coefficient models, single index models, general nonparametric regression models, and semiparametric/nonparametric quantile regression models.
  • Article
    Full-text available
    Recent years have seen an increase in the amount of statistics describing different phenomena based on “Big Data.” This term includes data characterized not only by their large volume, but also by their variety and velocity, the organic way in which they are created, and the new types of processes needed to analyze them and make inference from them. The change in the nature of the new types of data, their availability, and the way in which they are collected and disseminated is fundamental. This change constitutes a paradigm shift for survey research. There is great potential in Big Data, but there are some fundamental challenges that have to be resolved before its full potential can be realized. This report provides examples of different types of Big Data and their potential for survey research; it also describes the Big Data process, discusses its main challenges, and considers solutions and research needs.
  • Article
    In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. In the representative case of $L_1$ regularization, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensions of covariates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data.
  • Article
    Full-text available
    We propose MC+, a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty. We prove that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of $p\gg n$, and is proved to hold for exactly the MC+ solution among possibly many local minimizers. We prove that the MC+ attains certain minimax convergence rates in probability for the estimation of regression coefficients in $\ell_r$ balls. We use the SURE method to derive degrees of freedom and $C_p$-type risk estimates for general penalized LSE, including the LASSO and MC+ estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level. Comment: Published in at http://dx.doi.org/10.1214/09-AOS729 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
  • Econometrics, Unpublished manuscript
    • B Hansen
    Hansen, B. (2010). Econometrics, Unpublished manuscript. University of Wisconsin.
    • R Engle
    • D Hendry
    • J Richard
    Engle, R., Hendry, D. and Richard, J. (1983). Exogeneity. Econometrica. 51, 277-304.
  • Article
    Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.
  • Article
    This paper proposes a generalized method of moments (GMM) shrinkage method to efficiently estimate the unknown parameters θo identified by some moment restrictions, when there is another set of possibly misspecified moment conditions. We show that our method enjoys oracle-like properties; i.e., it consistently selects the correct moment conditions in the second set and at the same time, its estimator is as efficient as the GMM estimator based on all correct moment conditions. For empirical implementation, we provide a simple data-driven procedure for selecting the tuning parameters of the penalty function. We also establish oracle properties of the GMM shrinkage method in the practically important scenario where the moment conditions in the first set fail to strongly identify θo . The simulation results show that the method works well in terms of correct moment selection and the finite sample properties of its estimators. As an empirical illustration, we apply our method to estimate the life-cycle labor supply equation studied in MaCurdy (1981, Journal of Political Economy 89(6), 1059–1085) and Altonji (1986, Journal of Political Economy 94(3), 176–215). Our empirical findings support the validity of the instrumental variables used in both papers and confirm that wage is an endogenous variable in the labor supply equation.
  • Article
    This chapter aims to discuss (asymptotically) efficient estimation of the parameters of conditional moment restriction models. A useful type of model that imposes few restrictions and can allow for simultaneity is a conditional moment restriction model, where all that is specified is that a vector of residuals, consisting of known, prespecified functions of the data and parameters, has conditional mean zero given known variables. Estimators for the parameters of these models can be constructed by interacting functions of the residuals with functions of the conditioning variables and choosing the parameter estimates so that the sample moments of these interactions are zero. These estimators are conditional, implicit versions of the method of moments that are typically referred to as instrumental variables (IV) estimators, where the instruments are the functions of conditioning variables that interacted with the residuals. These estimators have the usual advantage of method of moments over maximum likelihood that their consistency only depends on correct specification of the residuals and conditioning variables, and not on the correctness of a likelihood function. Maximum likelihood may be more efficient than IV if the distribution is correctly specified, so that the usual bias/efficiency tradeoff is present for IV and maximum likelihood. The chapter discusses the description and motivation for IV estimators, several approaches to efficient estimation, the nearest neighbor nonparametric estimation of the optimal instruments, and estimation via linear combinations of functions.
  • Article
    This paper is concerned with an orthogonal wavelet series estimator of an unknown smooth regression function observed with noise on a bounded interval. A penalized least-squares approach is adopted and our method uses the specific asymptotic interpolating properties of the wavelet approximation generated by a particular wavelet basis, Daubechie’s coiflets. A simple procedure is described to estimate the smoothing parameter of the penalizing functional and conditions are given for the estimator to attain optimal convergence rates in the integrated mean square sense as the sample size increases to infinity. The results are illustrated with simulated and real examples and a comparison with other non-parametric smoothers is made.
  • Article
    Full-text available
    We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data where the number of covariates may be much larger than sample size. We propose an ℓ 1-penalized maximum likelihood estimator in an appropriate parameterization. This kind of estimation belongs to a class of problems where optimization and theory for non-convex functions is needed. This distinguishes itself very clearly from high-dimensional estimation with convex loss- or objective functions as, for example, with the Lasso in linear or generalized linear models. Mixture models represent a prime and important example where non-convexity arises. For FMR models, we develop an efficient EM algorithm for numerical optimization with provable convergence properties. Our penalized estimator is numerically better posed (e.g., boundedness of the criterion function) than unpenalized maximum likelihood estimation, and it allows for effective statistical regularization including variable selection. We also present some asymptotic theory and oracle inequalities: due to non-convexity of the negative log-likelihood function, different mathematical arguments are needed than for problems with convex losses. Finally, we apply the new method to both simulated and real data.
  • Article
    In the paper I give a brief review of the basic idea and some history and then discuss some developments since the original paper on regression shrinkage and selection via the lasso.
  • Article
    Full-text available
    Consider the high-dimensional linear regression model y = X β<sup>*</sup> + w , where y ∈ BBR n is an observation vector, X ∈ BBR n × d is a design matrix with d >; n , β<sup>*</sup> ∈ BBR d is an unknown regression vector, and w ~ N (0, σ<sup>2</sup> I ) is additive Gaussian noise. This paper studies the minimax rates of convergence for estimating β<sup>*</sup> in either l <sub>2</sub>-loss and l <sub>2</sub>-prediction loss, assuming that β<sup>*</sup> belongs to an lq -ball BBB q ( Rq ) for some q ∈ [0,1]. It is shown that under suitable regularity conditions on the design matrix X , the minimax optimal rate in l <sub>2</sub>-loss and l <sub>2</sub>-prediction loss scales as Θ( Rq ([(log d )/( n )])<sup>1-q</sup>/<sub>2</sub>). The analysis in this paper reveals that conditions on the design matrix X enter into the rates for l <sub>2</sub>-error and l <sub>2</sub>-prediction error in complementary ways in the upper and lower bounds. Our proofs of the lower bounds are information theoretic in nature, based on Fano's inequality and results on the metric entropy of the balls BBB q ( Rq ), whereas our proofs of the upper bounds are constructive, involving direct analysis of least squares over lq -balls. For the special case q =0, corresponding to models with an exact sparsity constraint, our results show that although computationally efficient l <sub>1</sub>-based methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix X than optimal algorithms involving least-squares over the l <sub>0</sub>-ball.
  • Article
    Asymptotic distribution theory is the primary method used to examine the properties of econometric estimators and tests. We present conditions for obtaining cosistency and asymptotic normality of a very general class of estimators (extremum estimators). Consistent asymptotic variance estimators are given to enable approximation of the asymptotic distribution. Asymptotic efficiency is another desirable property then considered. Throughout the chapter, the general results are also specialized to common econometric estimators (e.g. MLE and GMM), and in specific examples we work through the conditions for the various results in detail. The results are also extended to two-step estimators (with finite-dimensional parameter estimation in the first step), estimators derived from nonsmooth objective functions, and semiparametric two-step estimators (with nonparametric estimation of an infinite-dimensional parameter in the first step). Finally, the trinity of test statistics is considered within the quite general setting of GMM estimation, and numerous examples are given.
  • Article
    This paper is about efficient estimation and consistent tests of conditional moment restrictions. We use unconditional moment restrictions based on splines or other approximating functions for this purpose. Empirical likelihood estimation is particularly appropriate for this setting, because of its relatively low bias with many moment conditions. We give conditions so that efficiency of estimators and consistency of tests is achieved as the number of restrictions grows with the sample size. We also give results for generalized empirical likelihood, generalized method of moments, and nonlinear instrumental variable estimators.
  • Article
    In this paper, bounds on asymptotic efficiency are derived for a class of non-parametric models. The data are independent and identically distributed according to some unknown distribution F. There is a given function of the data and a parameter. The restrictions are that a conditional expectation of this function is zero at some point in the parameter space; this point is to be estimated. If F is assumed to be a multinomial distribution with known (finite) support, then the problem becomes parametric and the bound can be obtained from the information matrix. This bound turns out to depend only upon certain conditional moments, and not upon the support of the distribution. Since a general F can be approximated by a multinomial distribution, the multinomial bound applies to the general case.
  • Article
    Using some standard Hilbert space theory a simplified approach to computing efficiency bounds in semiparametric models is presented. We use some interesting examples to illustrate this approach and also obtain some results which seem to be new to the literature.
  • Article
    Full-text available
    Penalized likelihood methods are fundamental to ultra-high dimensional variable selection. How high dimensionality such methods can handle remains largely unknown. In this paper, we show that in the context of generalized linear models, such methods possess model selection consistency with oracle properties even for dimensionality of Non-Polynomial (NP) order of sample size, for a class of penalized likelihood approaches using folded-concave penalty functions, which were introduced to ameliorate the bias problems of convex penalty functions. This fills a long-standing gap in the literature where the dimensionality is allowed to grow slowly with the sample size. Our results are also applicable to penalized likelihood with the L(1)-penalty, which is a convex function at the boundary of the class of folded-concave penalty functions under consideration. The coordinate optimization is implemented for finding the solution paths, whose performance is evaluated by a few simulation examples and the real data analysis.
  • Article
    We consider learning formulations with non-convex objective functions that often occur in practical applications. There are two approaches to this problem: Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. Convex relaxation such as L1-regularization that solves the problem under some conditions. However it often leads to a sub-optimal solution in reality. This paper tries to remedy the above gap between theory and practice. In particular, we present a multi-stage convex relaxation scheme for solving problems with non-convex objective functions. For learning formulations with sparse regularization, we analyze the behavior of a specific multi-stage relaxation scheme. Under appropriate conditions, we show that the local solution obtained by this procedure is superior to the global solution of the standard L1 convex relaxation for learning sparse targets.
  • Article
    Full-text available
    Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are "irrepresentable" (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result.
  • Article
    We propose an instrumental variables method for estimation in linear models with endogenous regressors in the high-dimensional setting where the sample size n can be smaller than the number of possible regressors K, and L>=K instruments. We allow for heteroscedasticity and we do not need a prior knowledge of variances of the errors. We suggest a new procedure called the STIV (Self Tuning Instrumental Variables) estimator, which is realized as a solution of a conic optimization program. The main results of the paper are upper bounds on the estimation error of the vector of coefficients in l_p-norms for 1<= p<=\infty that hold with probability close to 1, as well as the corresponding confidence intervals. All results are non-asymptotic. These bounds are meaningful under the assumption that the true structural model is sparse, i.e., the vector of coefficients has few non-zero coordinates (less than the sample size n) or many coefficients are too small to matter. In our IV regression setting, the standard tools from the literature on sparsity, such as ther restricted eigenvalue assumption are inapplicable. Therefore, for our analysis we develop a new approach based on data-driven sensitivity characteristics. We show that, under appropriate assumptions, a thresholded STIV estimator correctly selects the non-zero coefficients with probability close to 1. The price to pay for not knowing which coefficients are non-zero and which instruments to use is of the order \sqrt{\log(L)} in the rate of convergence. We extend the procedure to deal with high-dimensional problems where some instruments can be non-valid. We obtain confidence intervals for non-validity indicators and we suggest a procedure, which correctly detects the non-valid instruments with probability close to 1.
  • Article
    Full-text available
    Concave regularization methods provide natural procedures for sparse recovery. However, they are difficult to analyze in the high dimensional setting. Only recently a few sparse recovery results have been established for some specific local solutions obtained via specialized numerical procedures. Still, the fundamental relationship between these solutions such as whether they are identical or their relationship to the global minimizer of the underlying nonconvex formulation is unknown. The current paper fills this conceptual gap by presenting a general theoretical framework showing that under appropriate conditions, the global solution of nonconvex regularization leads to desirable recovery performance; moreover, under suitable conditions, the global solution corresponds to the unique sparse local solution, which can be obtained via different numerical procedures. Under this unified framework, we present an overview of existing results and discuss their connections. The unified view of this work leads to a more satisfactory treatment of concave high dimensional sparse estimation procedures, and serves as guideline for developing further numerical procedures for concave regularization.
  • Article
    We study post-model selection estimators which apply ordinary least squares (ols) to the model selected by first-step penalized estimators. It is well known that lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that ols post lasso estimator performs at least as well as lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the lasso-based model selection "fails" in the sense of missing some components of the "true" regression model. By the "true" model we mean here the best $s$-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, ols post lasso estimator can perform strictly better than lasso, i.e. a strictly faster rate of convergence, if the lasso-based model selection correctly includes all components of the "true" model as a subset and also achieves sufficient sparsity. In the extreme case, when lasso perfectly selects the "true" model, the ols post lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by lasso which guarantees that this dimension is at most of the same order as the dimension of the "true" model. Moreover, our analysis is not limited to the lasso estimator acting as selector in the first step, but also applies to any other estimator, for example various forms of thresholded lasso, with good rates and good sparsity properties. Our analysis covers both traditional thresholding and a new practical, data-driven thresholding scheme that induces maximal sparsity subject to maintaining a certain goodness-of-fit. The latter scheme has theoretical guarantees similar to those of lasso or ols post lasso, but it dominates these procedures in a wide variety of experiments.
  • Article
    Full-text available
    In high-dimensional model selection problems, penalized least-square approaches have been extensively used. This paper addresses the question of both robustness and efficiency of penalized model selection methods, and proposes a data-driven weighted linear combination of convex loss functions, together with weighted L(1)-penalty. It is completely data-adaptive and does not require prior knowledge of the error distribution. The weighted L(1)-penalty is used both to ensure the convexity of the penalty term and to ameliorate the bias caused by the L(1)-penalty. In the setting with dimensionality much larger than the sample size, we establish a strong oracle property of the proposed method that possesses both the model selection consistency and estimation efficiency for the true non-zero coefficients. As specific examples, we introduce a robust method of composite L1-L2, and optimal composite quantile method and evaluate their performance in both simulated and real data examples.
  • Article
    We consider median regression and, more generally, quantile regression in high-dimensional sparse models. In these models the overall number of regressors p is very large, possibly larger than the sample size n, but only s of these regressors have non-zero impact on the conditional quantile of the response variable, where s grows slower than n. Since in this case the ordinary quantile regression is not consistent, we consider quantile regression penalized by the L1-norm of coefficients (L1-QR). First, we show that L1-QR is consistent at the rate of the square root of (s/n) log p, which is close to the oracle rate of the square root of (s/n), achievable when the minimal true model is known. The overall number of regressors p affects the rate only through the log p factor, thus allowing nearly exponential growth in the number of zero-impact regressors. The rate result holds under relatively weak conditions, requiring that s/n converges to zero at a super-logarithmic speed and that regularization parameter satisfies certain theoretical constraints. Second, we propose a pivotal, data-driven choice of the regularization parameter and show that it satisfies these theoretical constraints. Third, we show that L1-QR correctly selects the true minimal model as a valid submodel, when the non-zero coefficients of the true model are well separated from zero. We also show that the number of non-zero coefficients in L1-QR is of same stochastic order as s, the number of non-zero coefficients in the minimal true model. Fourth, we analyze the rate of convergence of a two-step estimator that applies ordinary quantile regression to the selected model. Fifth, we evaluate the performance of L1-QR in a Monte-Carlo experiment, and provide an application to the analysis of the international economic growth.
  • General estimating equations: model selection and estimation with diverging number of parameters. Manuscript The Dantzig selector: statistical estimation when p is much larger than n
    • M Caner
    • H E Zhang
    • T Tao
    Caner, M. and Zhang,H. (2009). General estimating equations: model selection and estimation with diverging number of parameters. Manuscript, North Carolina State University Candes, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann. Statist., 35 2313-2404
  • Article
    Full-text available
    Consider the standard linear regression model $\y = \Xmat \betastar + w$, where $\y \in \real^\numobs$ is an observation vector, $\Xmat \in \real^{\numobs \times \pdim}$ is a design matrix, $\betastar \in \real^\pdim$ is the unknown regression vector, and $w \sim \mathcal{N}(0, \sigma^2 I)$ is additive Gaussian noise. This paper studies the minimax rates of convergence for estimation of $\betastar$ for $\ell_\rpar$-losses and in the $\ell_2$-prediction loss, assuming that $\betastar$ belongs to an $\ell_{\qpar}$-ball $\Ballq(\myrad)$ for some $\qpar \in [0,1]$. We show that under suitable regularity conditions on the design matrix $\Xmat$, the minimax error in $\ell_2$-loss and $\ell_2$-prediction loss scales as $\Rq \big(\frac{\log \pdim}{n}\big)^{1-\frac{\qpar}{2}}$. In addition, we provide lower bounds on minimax risks in $\ell_{\rpar}$-norms, for all $\rpar \in [1, +\infty], \rpar \neq \qpar$. Our proofs of the lower bounds are information-theoretic in nature, based on Fano's inequality and results on the metric entropy of the balls $\Ballq(\myrad)$, whereas our proofs of the upper bounds are direct and constructive, involving direct analysis of least-squares over $\ell_{\qpar}$-balls. For the special case $q = 0$, a comparison with $\ell_2$-risks achieved by computationally efficient $\ell_1$-relaxations reveals that although such methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix $\Xmat$ than algorithms involving least-squares over the $\ell_0$-ball. Comment: Presented in part at the Allerton Conference on Control, Communication and Computer, Monticello, IL, October 2009
  • Article
    Full-text available
    We consider variable selection in high-dimensional linear models where the number of covariates greatly exceeds the sample size. We introduce the new concept of partial faithfulness and use it to infer associations between the covariates and the response. Under partial faithfulness, we develop a simplified version of the PC algorithm (Spirtes et al., 2000), the PC-simple algorithm, which is computationally feasible even with thousands of covariates and provides consistent variable selection under conditions on the random design matrix that are of a different nature than coherence conditions for penalty-based approaches like the Lasso. Simulations and application to real data show that our method is competitive compared to penalty-based approaches. We provide an efficient implementation of the algorithm in the R-package pcalg.
  • Article
    Model selection and sparse recovery are two important problems for which many regularization methods have been proposed. We study the properties of regularization methods in both problems under the unified framework of regularized least squares with concave penalties. For model selection, we establish conditions under which a regularized least squares estimator enjoys a nonasymptotic property, called the weak oracle property, where the dimensionality can grow exponentially with sample size. For sparse recovery, we present a sufficient condition that ensures the recoverability of the sparsest solution. In particular, we approach both problems by considering a family of penalties that give a smooth homotopy between $L_0$ and $L_1$ penalties. We also propose the sequentially and iteratively reweighted squares (SIRS) algorithm for sparse recovery. Numerical studies support our theoretical results and demonstrate the advantage of our new methods for model selection and sparse recovery. Comment: Published in at http://dx.doi.org/10.1214/09-AOS683 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
  • Article
    The empirical distribution function based on a sample is well known to be the maximum likelihood estimate of the distribution from which the sample was taken. In this paper the likelihood function for distributions is used to define a likelihood ratio function for distributions. It is shown that this empirical likelihood ratio function can be used to construct confidence intervals for the sample mean, for a class of M-estimates that includes quantiles, and for differentiable statistical functionals. The results are nonpara-metric extensions of Wilks's (1938) theorem for parametric likelihood ratios. The intervals are illustrated on some real data and compared in a simulation to some bootstrap confidence intervals and to intervals based on Student's t statistic. A hybrid method that uses the bootstrap to determine critical values of the likelihood ratio is introduced.
  • Article
    Full-text available
    Conditional heteroscedasticity has often been used in modelling and understanding the variability of statistical data. Under a general set-up which includes nonlinear time series models as a special case, we propose an efficient and adaptive method for estimating the conditional variance. The basic idea is to apply a local linear regression to the squared residuals. We demonstrate that, without knowing the regression function, we can estimate the conditional variance asymptotically as well as if the regression were given. This asymptotic result, established under the assumption that the observations are made from a strictly stationary and absolutely regular process, is also verified via simulation. Further, the asymptotic result paves the way for adapting an automatic bandwidth selection scheme. An application with financial data illustrates the usefulness of the proposed techniques.
  • Article
    Fan & Li (2001) propose a family of variable selection methods via penalized likelihood using concave penalty functions. The nonconcave penalized likelihood estimators enjoy the oracle properties, but maximizing the penalized likelihood function is computationally challenging, because the objective function is nondifferentiable and nonconcave. In this article we propose a new unified algorithm based on the local linear approximation (LLA) for maximizing the penalized likelihood for a broad class of concave penalty functions. Convergence and other theoretical properties of the LLA algorithm are established. A distinguished feature of the LLA algorithm is that at each LLA step, the LLA estimator can naturally adopt a sparse representation. Thus we suggest using the one-step LLA estimator from the LLA algorithm as the final estimates. Statistically, we show that if the regularization parameter is appropriately chosen, the one-step LLA estimates enjoy the oracle properties with good initial estimators. Computationally, the one-step LLA estimation methods dramatically reduce the computational cost in maximizing the nonconcave penalized likelihood. We conduct some Monte Carlo simulation to assess the finite sample performance of the one-step sparse estimation methods. The results are very encouraging.
  • Article
    This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as "screening" and the last stage as "cleaning." We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.
  • Article
    Full-text available
    This paper provides an introduction to alternative models of uncertain commodity prices. A model of commodity price movements is the engine around which any valuation methodology for commodity production projects is built, whether discounted cash flow (DCF) models or the recently developed modern asset pricing (MAP) methods. The accuracy of the valuation is in part dependent on the quality of the engine employed. This paper provides an overview of several basic commodity price models and explains the essential differences among them. We also show how futures prices can be used to discriminate among the models and to estimate better key parameters of the model chosen.
  • Article
    Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality "p", accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using "L"<sub>1</sub>-regularization and showed that it achieves the ideal risk up to a logarithmic factor  log ("p"). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor  log ("p") can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated. Copyright (c) 2008 Royal Statistical Society.
  • Article
    Semiparametric models are those where the functional form of some components is unknown. Efficiency bounds are of fundamental importance for such models. The provide a guide to estimation methods and give an asymptotic efficiency standard. The purpose of this paper is to provide an introduction to research methods and problems for semiparametric efficiency bonds. The nature of the bounds is discussed, as well as ways of calculating them. Their uses in solving estimation problems are outlined, including construction of semiparametric estimators and calculation of the limiting distribution. The paper includes new results as well as survey material. Copyright 1990 by John Wiley & Sons, Ltd.
  • Article
    This paper considers a generalized method of moments (GMM) estimation problem in which one has a vector of moment conditions, some of which are correct and some incorrect. The paper introduces several procedures for consistently selecting the correct moment conditions. Application of the results of the paper to instrumental variables estimation problems yields consistent procedures for selecting instrumental variables. The paper specifies moment selection criteria that are GMM analogues of the widely used BIC and AIC model selection criteria. (The latter is not consistent.) The paper also considers downward and upward testing procedures.
  • Article
    This paper studies estimators that make sample analogues of population orthogonality conditions close to zero. Strong consistency and asymptotic normality of such estimators is established under the assumption that the observable variables are stationary and ergodic. Since many linear and nonlinear econometric estimators reside within the class of estimators studied in this paper, a convenient summary of the large sample properties of these estimators, including some whose large sample properties have not heretofore been discussed, is provided.
  • Article
    This paper develops consistent model and moment selection criteria for GMM estimation. The criteria select the correct model specification and all correct moment conditions asymptotically. The selection criteria resemble the widely used likelihood-based selection criteria BIC, HQIC, and AIC. (The latter is not consistent.) The GMM selection criteria are based on the J statistic for testing over-identifying restrictions. Bonus terms reward the use of fewer parameters for a given number of moment conditions and the use of more moment conditions for a given number of parameters. The paper also considers a consistent downward testing procedure. The paper applies the model and moment selection criteria to dynamic panel data models with unobserved individual effects. The paper shows how to apply the selection criteria to select the lag length for lagged dependent variables, to detect the number and locations of structural breaks, to determine the exogeneity of regressors, and/or to determine the existence of correlation between some regressors and the individual effect. To illustrate the finite sample performance of the selection criteria and the testing procedures and their impact on parameter estimation, the paper reports the results of a Monte Carlo experiment on a dynamic panel data model.
  • Article
    This paper proposes an asymptotically efficient method for estimating models with conditional moment restrictions. Our estimator generalizes the maximum empirical likelihood estimator (MELE) of Qin and Lawless (1994). Using a kernel smoothing method, we efficiently incorporate the information implied by the conditional moment restrictions into our empirical likelihood-based procedure. This yields a one-step estimator which avoids estimating optimal instruments. Our likelihood ratio-type statistic for parametric restrictions does not require the estimation of variance, and achieves asymptotic pivotalness implicitly. The estimation and testing procedures we propose are normalization invariant. Simulation results suggest that our new estimator works remarkably well in finite samples. Copyright The Econometric Society 2004.
  • Article
    This paper describes a semiparametric estimator for binary response models in which there may be arbitrary heteroskedasticity of unknown form. The estimator is obtained by maximizing a smoothed version of the objective function of C. Manski's maximum score estimator. The smoothing procedure is similar to that used in kernel nonparametric density estimation. The resulting estimator's rate of convergence in probability is the fastest possible under the assumptions that are made. The centered, normalized estimator is asymptotically normally distributed. Methods are given for consistently estimating the parameters of the limiting distribution and for selecting the bandwidth required by the smoothing procedure. Copyright 1992 by The Econometric Society.
  • Article
    Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.
  • Article
    The lasso is a popular technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. In this work we derive a necessary condition for the lasso variable selection to be consistent. Consequently, there exist certain scenarios where the lasso is inconsistent for variable selection. We then propose a new version of the lasso, called the adaptive lasso, where adaptive weights are used for penalizing different coefficients in the l1 penalty. We show that the adaptive lasso enjoys the oracle properties; namely, it performs as well as if the true underlying model were given in advance. Similar to the lasso, the adaptive lasso is shown to be near-minimax optimal. Furthermore, the adaptive lasso can be solved by the same efficient algorithm for solving the lasso. We also discuss the extension of the adaptive lasso in generalized linear models and show that the oracle properties still hold under mild regularity conditions. As a byproduct of our theory, the nonnegative garotte is shown to be consistent for variable selection.
  • Article
    Full-text available
    Finding a sparse representation of signals is desired in many applications. For a representation dictionary D and a given signal S span{D}, we are interested in finding the sparsest vector # such that D# = S. Previous results have shown that if D is composed of a pair of unitary matrices, then under some restrictions dictated by the nature of the matrices involved, one can find the sparsest representation using an l 1 minimization rather than using the l 0 norm of the required composition. Obviously, such a result is highly desired since it leads to a convex Linear Programming form. In this paper we extend previous results and prove a similar relationship for the most general dictionary D. We also show that previous results are emerging as special cases of the new extended theory. In addition, we show that the above results can be markedly improved if an ensemble of such signals is given, and higher order moments are used.
  • Article
    Full-text available
    Variable selection plays an important role in high dimensional statistical modeling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality $p$, estimation accuracy and computational cost are two top concerns. In a recent paper, Candes and Tao (2007) propose the Dantzig selector using $L_1$ regularization and show that it achieves the ideal risk up to a logarithmic factor $\log p$. Their innovative procedure and remarkable result are challenged when the dimensionality is ultra high as the factor $\log p$ can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method based on a correlation learning, called the Sure Independence Screening (SIS), to reduce dimensionality from high to a moderate scale that is below sample size. In a fairly general asymptotic framework, the correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, an iterative SIS (ISIS) is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as the SCAD, Dantzig selector, Lasso, or adaptive Lasso. The connections of these penalized least-squares methods are also elucidated.
  • Article
    We consider linear inverse problems where the solution is assumed to have a sparse expansion on an arbitrary pre-assigned orthonormal basis. We prove that replacing the usual quadratic regularizing penalties by weighted l^p-penalties on the coefficients of such expansions, with 1 < or = p < or =2, still regularizes the problem. If p < 2, regularized solutions of such l^p-penalized problems will have sparser expansions, with respect to the basis under consideration. To compute the corresponding regularized solutions we propose an iterative algorithm that amounts to a Landweber iteration with thresholding (or nonlinear shrinkage) applied at each iteration step. We prove that this algorithm converges in norm. We also review some potential applications of this method.
  • Article
    We study the asymptotic properties of bridge estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase to infinity with the sample size. We are particularly interested in the use of bridge estimators to distinguish between covariates whose coefficients are zero and covariates whose coefficients are nonzero. We show that under appropriate conditions, bridge estimators correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance. Thus, bridge estimators have an oracle property in the sense of Fan and Li [J. Amer. Statist. Assoc. 96 (2001) 1348--1360] and Fan and Peng [Ann. Statist. 32 (2004) 928--961]. In general, the oracle property holds only if the number of covariates is smaller than the sample size. However, under a partial orthogonality condition in which the covariates of the zero coefficients are uncorrelated or weakly correlated with the covariates of nonzero coefficients, we show that marginal bridge estimators can correctly distinguish between covariates with nonzero and zero coefficients with probability converging to one even when the number of covariates is greater than the sample size.
  • Article
    Full-text available
    We derive the $l_{\infty}$ convergence rate simultaneously for Lasso and Dantzig estimators in a high-dimensional linear regression model under a mutual coherence assumption on the Gram matrix of the design and two different assumptions on the noise: Gaussian noise and general noise with finite variance. Then we prove that simultaneously the thresholded Lasso and Dantzig estimators with a proper choice of the threshold enjoy a sign concentration property provided that the non-zero components of the target vector are not too small.
  • Article
    Full-text available
    Meinshausen and Buhlmann [Ann. Statist. 34 (2006) 1436--1462] showed that, for neighborhood selection in Gaussian graphical models, under a neighborhood stability condition, the LASSO is consistent, even when the number of variables is of greater order than the sample size. Zhao and Yu [(2006) J. Machine Learning Research 7 2541--2567] formalized the neighborhood stability condition in the context of linear regression as a strong irrepresentable condition. That paper showed that under this condition, the LASSO selects exactly the set of nonzero regression coefficients, provided that these coefficients are bounded away from zero at a certain rate. In this paper, the regression coefficients outside an ideal model are assumed to be small, but not necessarily zero. Under a sparse Riesz condition on the correlation of design variables, we prove that the LASSO selects a model of the correct order of dimensionality, controls the bias of the selected model at a level determined by the contributions of small regression coefficients and threshold bias, and selects all coefficients of greater order than the bias of the selected model. Moreover, as a consequence of this rate consistency of the LASSO in model selection, it is proved that the sum of error squares for the mean response and the $\ell_{\alpha}$-loss for the regression coefficients converge at the best possible rates under the given conditions. An interesting aspect of our results is that the logarithm of the number of variables can be of the same order as the sample size for certain random dependent designs.