Regularity properties such as the incoherence condition, the restricted isometry property, compatibility, restricted eigenvalue and iq sensitivity of covariate matrices play a pivotal role in high-dimensional regression and compressed sensing. Yet, like computing the spark of a matrix, we first show that it is NP-hard to check the conditions involving all submatrices of a given size.
In event studies of capital market efficiency, an earnings surprise has historically been measured by the consensus error, defined as earnings minus the consensus or average of professional forecasts. The rationale is that the consensus is an accurate measure of the markets expectation of earnings. But since forecasts can be biased due to conflicts of interest and some investors can see through these conflicts, this rationale is flawed and the consensus error a biased measure of an earnings surprise. We show that the fraction of forecasts that miss on the same side (FOM), by ignoring the size of the misses, is less sensitive to such bias and a better measure of an earnings surprise. As a result, FOM out-performs the consensus error and its related robust statistics in explaining stock price movements around and subsequent to the announcement date.
Motivated by the problem of colocalization analysis in fluorescence microscopic imaging, we study in this paper structured detection of correlated regions between two random processes observed on a common domain. We argue that although intuitive, direct use of the maximum log-likelihood statistic suffers from potential bias and substantially reduced power, and introduce a simple size-based normalization to overcome this problem. We show that scanning with the proposed size-corrected likelihood ratio statistics leads to optimal correlation detection over a large collection of structured correlation detection problems.
We propose a bootstrap-based robust high-confidence level upper bound (Robust H-CLUB) for assessing the risks of large portfolios. The proposed approach exploits rank-based and quantile-based estimators, and can be viewed as a robust extension of the H-CLUB procedure (Fan etal., 2015). Such an extension allows us to handle possibly misspecified models and heavy-tailed data, which are stylized features in financial returns. Under mixing conditions, we analyze the proposed approach and demonstrate its advantage over H-CLUB. We further provide thorough numerical results to back up the developed theory, and also apply the proposed method to analyze a stock market dataset.
Several novel large volatility matrix estimation methods have been developed based on the high-frequency financial data. They often employ the approximate factor model that leads to a low-rank plus sparse structure for the integrated volatility matrix and facilitates estimation of large volatility matrices. However, for predicting future volatility matrices, these nonparametric estimators do not have a dynamic structure to implement. In this paper, we introduce a novel It diffusion process based on the approximate factor models and call it a factor GARCH-It model. We then investigate its properties and propose a quasi-maximum likelihood estimation method for the parameter of the factor GARCH-It model. We also apply it to estimating conditional expected large volatility matrices and establish their asymptotic properties. Simulation studies are conducted to validate the finite sample performance of the proposed
Two measures of sensitivity to initial conditions in nonlinear time series are proposed. The notions give some insight into the relationship between the Fisher information in statistical estimation and initial-value sensitivity in dynamical systems. By using the locally polynomial regression, we develop nonparametric estimates for a conditional density function, its square root and its partial derivatives. The proposed procedures are innovative and of interests in their own right. They are also used to estimate the sensitive measures. The asymptotic normality of the proposed estimators have been proved. We also propose a simple and intuitively appealing method for choosing the bandwidths. Two simulated examples are used as illustrations.
Value at Risk is a fundamental tool for managing market risks. It measures the worst loss to be expected of a portfolio over a given time horizon under normal market conditions at a given confidence level. Calculation of VaR frequently involves estimating the volatility of return processes and quantiles of standardized returns. In this paper, several semiparametric techniques are introduced to estimate the volatilities of the market prices of a portfolio. In addition, both parametric and nonparametric techniques are proposed to estimate the quantiles of standardized return processes. The newly proposed techniques also have the flexibility to adapt automatically to the changes in the dynamics of market prices over time. Their statistical efficiencies are studied both theoretically and empirically. The combination of newly proposed techniques for estimating volatility and standardized quantiles yields several new techniques for forecasting multiple period VaR. The performance of the newly proposed VaR estimators is evaluated and compared with some of existing methods. Our simulation results and empirical studies endorse the newly proposed time-dependent semiparametric approach for estimating VaR.
1. Introduction. 1.1. Background. Covariance matrix estimation is fundamental for almost all areas of multivariate analysis and many other applied problems. In particular, covariance matrices and their inverses play a central role in risk management and portfolio allocation. For example, the smallest and largest eigenvalues are related to the minimum and maximum variances of the selected portfolio, respectively, and the eigenvectors are related to portfolio allocation. Therefore, we need a good covariance matrix estimator that is well-conditioned, ie inverting it does not excessively amplify the estimation error. See Goldfarb and Iyengar (2003) for applications of covariance matrices to portfolio selections and Johnstone (2001) for their statistical implications. Estimating large dimensional covariance matrices is intrinsically challenging. For example, in portfolio allocation and risk management, the number p of stocks can
Prediction error is critical to assessing the performance of statistical methods and selecting statistical models. We propose the cross-validation and approximated cross-validation methods for estimating prediction error under a broad q-class of Bregman divergence for error measures which embeds nearly all of the commonly used loss functions in regression, classification procedures and machine learning literature. The approximated cross-validation formulas are analytically derived, which facilitate fast estimation of prediction error under the Bregman divergence. We then study a data-driven optimal bandwidth selector for the local-likelihood estimation that minimizes the overall prediction error or equivalently the covariance penalty. It is shown that the covariance penalty and cross-validation methods converge to the same mean-prediction-error-criterion. We also propose a lower-bound scheme for computing the local logistic regression estimates and demonstrate that it is as simple and stable as the local least-squares regression estimation. The algorithm monotonically enhances the target local-likelihood and converges. The idea and methods are extended to the generalized varying-coefficient models and semiparametric models.
There is gained interest in understanding statistical inference under possibly non-sparse high-dimensional models. For a given component of the regression coefficient, we show that the difficulty of the problem depends on the sparsity of the corresponding row of the precision matrix of the covariates, not the sparsity of the regression coefficients. We develop new concepts of uniform and essentially uniform non-testability that allow the study of limitations of tests across a broad set of alternatives. Uniform non-testability identifies a collection of alternatives such that the power of any test, against any alternative in the group, is asymptotically at most equal to the nominal size of the test. Implications of the new constructions include new minimax testability results that in sharp contrast to the existing results, do not depend on the sparsity of the regression parameters. We identify new tradeoffs between testability and feature correlation. In particular, we show that in models with weak feature correlations minimax lower bound can be attained by a test whose power has the parametric rate regardless of the size of the model sparsity.
Consider a normal model with unknown mean bounded by a known constant. This paper deals with minimax estimation of the squared mean. We establish an expression for the asymptotic minimax risk. This result is applied in nonparametric estimation of quadratic functionals.
Variable selection is vital to statistical data analyses. Many of procedures in use are ad hoc stepwise selection procedures, which are computationally expensiveandignore stochastic errors in the variable selection process of previous steps. An automatic and simultaneous variable selection procedure can be obtained by using a penalized likelihood method. In traditional linear models, the best subset selection and stepwise deletion methods coincide with a penalized leastsquares method when design matrices are orthonormal. In this paper, we propose a few new approaches to selecting variables for linear models, robust regression models and generalized linear models based on a penalized likelihood approach. A family of thresholding functions are proposed. The LASSO proposed by Tibshirani (1996) is a member of the penalized leastsquares with the # #-penalty. A smoothly clipped absolute deviation (SCAD) penalty function is introduced to ameliorate the properties of # #-penalty. A unied algorithm is introduced, which is backed up by statistical theory. The new approaches are compared with the ordinary leastsquares methods, the garrote method by Breiman (1995) and the LASSO method by Tibshirani (1996). Our simulation results show that the newly proposed methods compare favorably with other approaches as an automatic variable selection technique. Because of simultaneous selection of variables and estimation of parameters, we are able to give a simple estimated standard error formula, which is tested to be accurate enough for practical applications. Two real data examples illustrate the versatility and eectiveness of the
When people in a society want to make inference about some parameter, each person may want to use data collected by other people. Information (data) exchange in social networks is usually costly, so to make reliable statistical decisions, people need to weigh the benefits and costs of information acquisition. Conflicts of interests and coordination problems will arise in the process. Classical statistics does not consider peoples incentives and interactions in the data-collection process. To address this imperfection, this work explores multi-agent Bayesian inference problems with a game theoretic social network model. Motivated by our interest in aggregate inference at the societal level, we propose a new concept, <i>finite population learning</i>, to address whether with high probability, a large fraction of people in a given finite population network can make good inference. Serving as a foundation, this concept enables
Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and q sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given dataset. This is problematic, because they are at the core of the theory of sparse regression. Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications. However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, q sensitivity, has certain
Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the Bless of Dimensionality. As an illustrative
Many statistical models seek relationship between variables via subspaces of reduced dimensions. For instance, in factor models, variables are roughly distributed around a low dimensional subspace determined by the loading matrix; in mixed linear regression models, the coefficient vectors for different mixtures form a subspace that captures all regression functions; in multiple index models, the effect of covariates is summarized by the effective dimension reduction space.
We prove a sharp Bernstein inequality for general-state-space and not necessarily reversible Markov chains. It is sharp in the sense that the variance proxy term is optimal. Our result covers the classical Bernstein's inequality for independent random variables as a special case.
Value-at-risk measures the worst loss to be expected of a portfolio over a given time horizon at a given confidence level. Calculation of VaR frequently involves estimating the volatility of return processes and quantiles of standardized returns. In this paper, several semiparametric techniques are introduced to estimate the volatilities. In addition, both parametric and nonparametric techniques are proposed to estimate the quantiles of standardized return processes. The newly proposed techniques also have the flexibility to adapt automatically to the changes in the dynamics of market prices over time. The combination of newly proposed techniques for estimating volatility and standardized quantiles yields several new techniques for evaluating multiple period VaR. The performance of the newly proposed VaR estimators is evaluated and compared with some of existing methods. Our simulation results and empirical
Error variance estimation plays an important role in statistical inference for high-dimensional regression models. This article concerns with error variance estimation in high-dimensional sparse additive model. We study the asymptotic behavior of the traditional mean squared errors, the naive estimate of error variance, and show that it may significantly underestimate the error variance due to spurious correlations that are even higher in nonparametric models than linear models. We further propose an accurate estimate for error variance in ultrahigh-dimensional sparse additive model by effectively integrating sure independence screening and refitted cross-validation techniques. The root <i>n</i> consistency and the asymptotic normality of the resulting estimate are established. We conduct Monte Carlo simulation study to examine the finite sample performance of the newly proposed estimate. A real data example is used to
This paper studies model selection consistency for high dimensional sparse regression when data exhibits both cross-sectional and serial dependency. Most commonly-used model selection methods fail to consistently recover the true model when the covariates are highly correlated. Motivated by econometric studies, we consider the case where covariate dependence can be reduced through factor model, and propose a consistent strategy named Factor-Adjusted Regularized Model Selection (FarmSelect). By separating the latent factors from idiosyncratic components, we transform the problem from model selection with highly correlated covariates to that with weakly correlated variables. Model selection consistency as well as optimal rates of convergence are obtained under mild conditions. Numerical studies demonstrate the nice finite sample performance in terms of both model selection and out-of-sample prediction. Moreover, our method is flexible in a sense that it pays no price for weakly correlated and uncorrelated cases. Our method is applicable to a wide range of high dimensional sparse regression problems. An R-package FarmSelect is also provided for implementation.
Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the assumption of joint normality is often imposed, which is too stringent for many applications. To address these challenges, in this paper we propose a factoradjusted robust procedure for large-scale simultaneous inference with control of the false discovery proportion. We demonstrate that robust factor adjustments are extremely important in both improving the power of the tests and controlling FDP. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. Our proposed procedures are implemented in the R-package FarmTest. Supported by NSF Grants DMS-1662139, DMS-1712591, and NIH Grant R01-GM072611-12.
Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogeneous. Yet, endogeneity arises easily in high-dimensional regression due to a large pool of regressors and this causes the inconsistency of the penalized least-squares methods and possible false scientific discoveries. A necessary condition for model selection of a very general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the possible endogeneity, we construct a novel penalized focussed generalized method of moments (FGMM) criterion function and offer a new optimization algorithm. The FGMM is not a smooth function. To establish its asymptotic properties, we first study the model selection consistency and an oracle property for a general class of penalized regression methods. These results are then used to show that the FGMM possesses an oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.
Measuring conditional dependence is an important topic in statistics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic significance level and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. Numerical results and real data analysis show the superiority of the new method.