Do you want to **read the rest** of this article?

# Projected Principal Component Analysis in Factor Models

**Article**

*in*SSRN Electronic Journal 44(1) · June 2014

*with*268 Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

DOI: 10.2139/ssrn.2450770 · Source: arXiv

Cite this publicationAbstract

This paper introduces a Projected Principal Component Analysis
(Projected-PCA), which is based on the projection of the data matrix onto a
given linear space before performing the principal component analysis. When it
applies to high-dimensional factor analysis, the projection removes
idiosyncratic noisy components. We show that the unobserved latent factors can
be more accurately estimated than the conventional PCA if the projection is
genuine, or more precisely the factor loading matrices are related to the
projected linear space, and that they can be estimated accurately when the
dimensionality is large, even when the sample size is finite. In an effort to
more accurately estimating factor loadings, we propose a flexible
semi-parametric factor model, which decomposes the factor loading matrix into
the component that can be explained by subject-specific covariates and the
orthogonal residual component. The covariates effect on the factor loadings are
further modeled by the additive model via sieve approximations. By using the
newly proposed Projected-PCA, the rates of convergence of the smooth factor
loading matrices are obtained, which are much faster than those of the
conventional factor analysis. The convergence is achieved even when the sample
size is finite and is particularly appealing in the
high-dimension-low-sample-size situation. This leads us to developing
nonparametric tests on whether observed covariates have explaining powers on
the loadings and whether they fully explain the loadings. Finally, the proposed
method is illustrated by both simulated data and the returns of the components
of the S\&P 500 index.

- ... The problem of estimating a covariance matrix and its inverse has been fundamental in many areas of statistics, including principal component analysis (PCA), linear discriminative analysis for classification, and undirected graphical models, just to name a few. The intense research in high dimensional statistics has contributed a stream of papers related to covariance matrix estimation, including sparse principal component analysis (Johnstone and Lu, 2009; Amini and Wainwright, 2008; Vu and Lei, 2012; Birnbaum et al., 2013; Berthet and Rigollet, 2013; Ma, 2013; Cai et al., 2013), sparse covariance estimation (Bickel and Levina, 2008; Cai and Liu, 2011; Cai et al., 2010; Lam and Fan, 2009; Ravikumar et al., 2011) and factor model analysis (Stock and Watson, 2002; Bai, 2003; Fan et al., 2008 Fan et al., , 2013 Fan et al., , 2014 Onatski, 2012 ). A strong interest in precision matrix estimation (undirected graphical model) has also emerged in the statistics community following the pioneering works in Meinshausen and Bühlmann (2006) and Friedman et al. (2008). ...... A motivating example from economic and financial studies is the classical Fama-French model, where y it 's represent excess returns of stocks in the market and f t 's are interpreted as common factors driving the market. It is more natural to allow for weak temporal dependence such as α-mixing as in the work of Fan et al. (2014). Though possible, we assume independence in this paper for the sake of simplicity of analysis. ...... The dataset we used in our analysis consists of daily returns of 393 stocks, all of which are large market capitalization constituents of S&P 500 index, collected without missing values from 2005 to 2013. This dataset has also been used in Fan et al. (2014), where they investigated how covariates (e.g. size, volume) could be utilized to help estimate factors and factor loadings, whereas the focus of the current paper is to develop robust methods in the presence of heavy tailed data. ...Article
- Feb 2016
- J ECONOMETRICS

In this paper, we study robust covariance estimation under the approximate factor model with observed factors. We propose a novel framework to first estimate the initial joint covariance matrix of the observed data and the factors, and then use it to recover the covariance matrix of the observed data. We prove that once the initial matrix estimator is good enough to maintain the element-wise optimal rate, the whole procedure will generate an estimated covariance with desired properties. For data with only bounded fourth moments, we propose to use Huber loss minimization to give the initial joint covariance estimation. This approach is applicable to a much wider range of distributions, including sub-Gaussian and elliptical distributions. We also present an asymptotic result for Huber's M-estimator with a diverging parameter. The conclusions are demonstrated by extensive simulations and real data analysis. - ... Li et al. (2016) developed the Supervised Singular Value Decomposition (SupSVD) method that exploits linear models to accommodate covariates in dimension reduction of a primary data matrix. Later, Fan et al. (2016) proposed the projected PCA that generalizes SupSVD by allowing nonparametric relations between covariates and factors. However, these methods are only suitable for a single data set, and cannot easily extend to multi-view data. ...... which corresponds to the projected PCA model proposed by Fan et al. (2016). In particular, if we let the function f (·) be a linear mapping, i.e., f (X) = XB, where B is a q × r coefficient matrix, the above model further connects to the SupSVD model developed in Li et al. (2016). ...... Recently, a couple of methods were proposed to allow covariates to inform factor analysis.Li et al. (2016)developed the Supervised Singular Value Decomposition (SupSVD) method that exploits linear models to accommodate covariates in dimension reduction of a primary data matrix. Later,Fan et al. (2016)proposed the projected PCA that generalizes SupSVD by allowing nonparametric relations between covariates and factors. However, these methods are only suitable for a single data set, and cannot easily extend to multi-view data. ...Article
- Mar 2017
- Biometrics

In modern biomedical research, it is ubiquitous to have multiple data sets measured on the same set of samples from different views (i.e., multi-view data). For example, in genetic studies, multiple genomic data sets at different molecular levels or from different cell types are measured for a common set of individuals to investigate genetic regulation. Integration and reduction of multi-view data have the potential to leverage information in different data sets, and to reduce the magnitude and complexity of data for further statistical analysis and interpretation. In this paper, we develop a novel statistical model, called supervised integrated factor analysis (SIFA), for integrative dimension reduction of multi-view data while incorporating auxiliary covariates. The model decomposes data into joint and individual factors, capturing the joint variation across multiple data sets and the individual variation specific to each set respectively. Moreover, both joint and individual factors are partially informed by auxiliary covariates via nonparametric models. We devise a computationally efficient Expectation-Maximization (EM) algorithm to fit the model under some identifiability conditions. We apply the method to the Genotype-Tissue Expression (GTEx) data, and provide new insights into the variation decomposition of gene expression in multiple tissues. Extensive simulation studies and an additional application to a pediatric growth study demonstrate the advantage of the proposed method over competing methods. - ... Therefore, the heterogeneity effect is modeled as a low rank component Λ i Λ i of the population covariance matrix of X i t . Later, we will show that, under a pervasive assumption, the heterogeneity component can be estimated by directly applying principal component analysis (PCA) or Projected-PCA, which is more accurate when there are sufficiently informative covariates W i (Fan et al., 2016). Let Λ i F i be the estimated heterogeneity component. ...... However, none of these models incorporate the external covariate information. The semiparametric factor model (2.1) was first proposed by Connor and Linton (2007) and further investigated by Connor et al. (2012); Fan et al. (2016). Using sufficiently informative external covariates, we are able to more accurately estimate the factors and loadings, and hence yield better heterogeneous adjustment. ...... The above set of assumptions are commonly used in the literature, see Bai and Ng (2013); Fan et al. (2016). We omit detailed discussions here. ...Article
- Feb 2016

Heterogeneity is an unwanted variation when analyzing aggregated datasets from multiple sources. Though different methods have been proposed for heterogeneity adjustment, no systematic theory exists to justify these methods. In this work, we propose a generic framework named ALPHA (short for Adaptive Low-rank Principal Heterogeneity Adjustment) to model, estimate, and adjust heterogeneity from the original data. Once the heterogeneity is adjusted, we are able to remove the biases of batch effects and to enhance the inferential power by aggregating the homogeneous residuals from multiple sources. Under a pervasive assumption that the latent heterogeneity factors simultaneously affect a large fraction of observed variables, we provide a rigorous theory to justify the proposed framework. Our framework also allows the incorporation of informative covariates and appeals to the "Bless of Dimensionality". As an illustrative application of this generic framework, we consider a problem of estimating high-dimensional precision matrix for graphical model inference based on multiple datasets. We also provide thorough numerical studies on both synthetic datasets and a brain imaging dataset to demonstrate the efficacy of the developed theory and methods. - ... Traditional multivariate time series models face computational challenges and loses efficiency when the dimension grows. Factor analysis is considered as an effective way to alleviate these problems by dimension reduction and to model the dynamics of high-dimensional time series (Geweke, 1977;Chamberlain and Rothschild, 1983;Peña and Box, 1987;Forni and Reichlin, 1998;Forni et al., 2000;Bai and Ng, 2002;Stock and Watson, 2002a,b;Peña and Poncela, 2006;Hallin and Liska, 2007;Fan et al., 2016). Wang et al. (2019) further extended factor models to matrix-valued time series, achieving greater dimension reduction by utilizing the matrix structure of data and taking both row and column dimension reduction. ...... Nonlinear dynamics have been a popular topic in factor models ( Yalcin and Amemiya, 2001;Cunha et al., 2010;Fan et al., 2016Fan et al., , 2017) during the past a few decades. Structural breaks ( Breitung and Eickmeier, 2011;Chen et al., 2014;Han and Inoue, 2015;Ma and Su, 2018;Bai et al., 2017;Su and Wang, 2017;Barigozzi et al., 2018), thresholding (Massacci, 2017;Liu and Chen, 2019), and Markov chain ( Liu and Chen, 2016) were introduced to interpret the nonlinear behaviors observed in vector-valued time series data. ...PreprintFull-text available
- Apr 2019

As is known, factor analysis is a popular method to reduce dimension for high-dimensional data. For matrix data, the dimension reduction can be more effectively achieved through both row and column directions. In this paper, we introduce a threshold factor models to analyze matrix-valued high-dimensional time series data. The factor loadings are allowed to switch between regimes, controlling by a threshold variable. The estimation methods for loading spaces, threshold value, and the number of factors are proposed. The asymptotic properties of these estimators are investigated. Not only the strengths of thresholding and factors, but also their interactions from different directions and different regimes play an important role on the estimation performance. When the thresholding and factors are all strong across regimes, the estimation is immune to the impact that the increase of dimension brings, which breaks the curse of dimensionality. When the thresholding in two directions and factors across regimes have different levels of strength, we show that estimators for loadings and threshold value experience 'helping' effects against the curse of dimensionality. We also discover that even when the numbers of factors are overestimated, the estimators are still consistent. The proposed methods are illustrated with both simulated and real examples. - ... Our framework is analogous to a class of supervised factorization methods for two-way data (matrices), in which covariates inform a PCA/SVD model. Our framework most directly extends the supervised SVD (SupSVD) approach (Li, Yang, Nobel & Shen 2016 ), which has also been generalized to accommodate sparse or functional PCA models (Li, Yang, Nobel & Shen 2016) and non-parametric relations between the covariates and principal components (Fan et al. 2016). As in matrix factorization techniques such as PCA, the goal of multiway factorization is to capture underlying patterns that explain variation in the data. ...... Our application to Moreover, there is a growing body of work on multiway factorization methods for sparse or functional data (Allen 2012Allen , 2013 ). The SupCP framework may be extended to accommodate sparse and functional data, or non-linear covariate effects, which are analogous to recent extensions of the SupSVD framework (Li, Shen & Huang 2016, Fan et al. 2016 ). Finally , here we have focused on allowing a random latent variable and covariate supervision for one mode (the sample mode). ...ArticleFull-text available
- Mar 2018

We describe a probabilistic PARAFAC/CANDECOMP (CP) factorization for multiway (i.e., tensor) data that incorporates auxiliary covariates, SupCP. SupCP generalizes the supervised singular value decomposition (SupSVD) for vector-valued observations, to allow for observations that have the form of a matrix or higher-order array. Such data are increasingly encountered in biomedical research and other fields. We describe a likelihood-based latent variable representation of the CP factorization, in which the latent variables are informed by additional covariates. We give conditions for identifiability, and develop an EM algorithm for simultaneous estimation of all model parameters. SupCP can be used for dimension reduction, capturing latent structures that are more accurate and interpretable due to covariate supervision. Moreover, SupCP specifies a full probability distribution for a multiway data observation with given covariate values, which can be used for predictive modeling. We conduct comprehensive simulations to evaluate the SupCP algorithm, and we apply it to a facial image database with facial descriptors (e.g., smiling / not smiling) as covariates. Software is available at https://github.com/lockEF/SupCP . - ... In contrast, reduced-rank regression (RRR, Izenman, 1975; Tso, 1981) and envelop models (Cook, Li and Chiaromonte, 2010) provide sufficient dimension reduction (Cook and Ni, 2005) for regression problems. Variants of Principal Component Analysis (PCA, Wold, Kettaneh and Tjessem, 1996; Fan, Liao and Wang, 2014; Di et al., 2009 ) has been proposed to incorporate auxillary information . Recently, Li et al. (2015b) proposed SupSVD, a supervised PCA that encompasses regular PCA to RRR. ...... In contrast, reduced-rank regression (RRR,Izenman, 1975;Tso, 1981) and envelop models (Cook,Li and Chiaromonte, 2010) provide sufficient dimension reduction (Cook and Ni, 2005) for regression problems. Variants of Principal Component Analysis (PCA, Wold,Kettaneh and Tjessem, 1996;Fan, Liao and Wang, 2014;Di et al., 2009) has been proposed to incorporate auxillary information. Recently,Li et al. (2015b)proposed SupSVD, a supervised PCA that encompasses regular PCA to RRR. ...We consider dimension reduction of multivariate data under the existence of various types of auxiliary information. We propose a criterion that provides a series of orthogonal directional vectors, that form a basis for dimension reduction. The proposed method can be thought of as an extension from the continuum regression, and the resulting basis is called continuum directions. We show that these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. With a presence of binary supervision data, the proposed directions can be directly used for a two-group classification. Numerical studies show that the proposed method works well in high-dimensional settings where the variance of the first principal component is much larger than the rest.
- ... Rotation Fig. 1, shown that Principal Component Analysis used to find discriminant feature in some dataset. The selection method used to find correlated data and finally established a correlation matrix, while the eigenvalue set to 1 which the eigenvalue is the number of variants associated with factors used for eigenvalues is worth more of 1, have an impact on the features only have eigenvalues greater than 1 will be retained, while the variance factor less than 1 will be reduced in accordance with the standards as written in the article titled F. Wang factor Analysis and Principal Component Analysis in 2009 [3], whereas varimax rotation used to maximize the amount of variance of the squared correlation between variables and factors . This is achieved if every variable that has given a high load on a single factor but near zero load on the remaining factors and if the factors are given based on only a few variables with a very high load on this factor, while the remaining variables have the burden close to zero on this factor. ...ArticleFull-text available
- Jan 2016

This research will use the algorithm K-Nearest Neighbour (K-NN) to classify internet data traffic, K-NN is suitable for large amounts of data and can produce a more accurate classification, K-NN algorithm has a weakness takes computing high because K-NN algorithm calculating the distance of all existing data. One solution to overcome these weaknesses is to do the clustering process before the classification process, because the clustering process does not require high computing time, clustering algorithm that can be used is Fuzzy C-Mean algorithm, the Fuzzy C-Mean algorithm does not need to be determined in first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered, but the algorithm Fuzzy C-Mean has the disadvantage of clustering results obtained are often not the same even though the same input data, this is because the initial dataset that of the Fuzzy C-Mean is not optimal, to optimize initial datasets in this research using feature selection algorithm, after main feature of dataset selected the output from fuzzy C-Mean become consistent. Selection of the features is a method that is expected to provide an initial dataset that is optimum for the algorithm Fuzzy C-Means. Algorithms for feature selection in this study used are Principal Component Analysis (PCA). PCA reduced non significant attribute to created optimal dataset and can improve performance clustering and classification algorithm. Results in this study is an combining method of classification, clustering and feature extraction of data, these three methods successfully modeled to generate a data classification method of internet bandwidth usage that has high accuracy and have a fast performance. - ... If c j = 0 as in Fan et al. (2013a), the proposed S-POET method does not shrink the spiked empirical eigenvalues. However, when we have semi-weak factors (Fan et al., 2014) whose corresponding eigenvalues are as weak as of order p/T , shrinkage is necessary to guarantee the convergence of ∆ L1 . On the other hand, if instead POET is applied to estimate covariance matrix, ∆ L1 = O P (p/(λ m T ) + T −1/2 ) which is only bounded. ...We derived the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the spikeness of leading eigenvalues, sample size, and dimensionality. This new regime allows high dimensionality and diverging eigenvalue spikes and provides new insights on the roles the leading eigenvalues, sample size, and dimensionality played in the principal component analysis. The results are proven by a new technical device, which swaps the role of rows and columns and converts the high-dimensional problems into low-dimensional ones. Our results are a natural extension of those in Paul (2007) to more general setting with new insights and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of the estimation of leading eigenvalues and eigenvectors by using the principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), which corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.
- ArticleFull-text available
- May 2018

The study investigated the relationship between permissive parenting styles and examination cheating tendencies among secondary school students in Siaya Sub County, Kenya. Diana Baumrind’s parenting styles theory and Ajzen’s theory of Planned Behaviour provided a theoretical framework for the study while adopting a Correlational study design within a mixed methods approach. The target population was 1,908 form three students, 35 Teacher Counselors and 35 Deputy Principals. A sample size of 190 Form Three students, which was 10% of the population of students, was used after stratified random sampling. In addition, 8 Teacher Counselors and 8 Deputy Principals purposively sampled formed part of the participants. Parenting style and Involvement in Examination Cheating Tendency Questionnaires were used to collect quantitative data from form three students while interview schedule was used to collect qualitative data from the Teacher Counselors and Deputy Principals. Validity was ascertained by expert judgment of two university lecturers while reliability of the instrument was ensured using Cronchbar reliability test, where an index of 0.77413 was obtained. Quantitative data was analyzed using descriptive statistics as well as inferential statistics such as Pearson Correlation, aided by SPSS version 22, while qualitative data was analyzed through thematic framework. The findings revealed that permissive parenting has a strong positive influence on examination cheating tendencies with r=0.641 p<0.05. The study recommended that Kenyan Teachers’ Service Commission should train more teacher counselors in schools to cope with the large number of students who have varied parental backgrounds. - Article
- Aug 2017
- J ECONOMETRICS

We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional factor model implemented by the principal component analysis. Using the extracted factors, we develop a novel forecasting method called the sufficient forecasting, which provides a set of sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. The projected principal component analysis will be employed to enhance the accuracy of inferred factors when a semi-parametric factor model is assumed. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even in the presence of a nonparametric forecasting function. The proposed method extends the sufficient dimension reduction to high-dimensional regimes by condensing the cross-sectional information through factor models. We derive asymptotic properties for the estimate of the central subspace spanned by these projection directions as well as the estimates of the sufficient predictive indices. We further show that the natural method of running multiple regression of target on estimated factors yields a linear estimate that actually falls into this central subspace. Our method and theory allow the number of predictors to be larger than the number of observations. We finally demonstrate that the sufficient forecasting improves upon the linear forecasting in both simulation studies and an empirical study of forecasting macroeconomic variables. - Chapter
- Jan 2016

K-NN is a classification algorithm which suitable for large amounts of data and have higher accuracy for internet traffic classification, unfortunately K-NN algorithm has disadvantage in computation time because K-NN algorithm calculates the distance of all data in some dataset. This research provide alternative solution to overcome K-NN computation time, the alternative solution is to implement clustering process before the classification process. Clustering process does not require high computation time. Fuzzy C-Mean algorithm is implemented in this research. The Fuzzy C-Mean algorithm clusters the based datasets that be entered. Fuzzy C-Mean has disadvantage of clustering, that is the results are often not the same even though the input data are same, and the initial dataset that of the Fuzzy C-Mean is not optimal, to optimize the initial datasets, in this research, feature selection algorithm is used, after selecting the main feature of dataset, the output from fuzzy C-Mean become consistent. Selection of the features is a method that is expected to provide an initial dataset that is optimum for the algorithm Fuzzy C-Means. Algorithms for feature selection in this study used is Principal Component Analysis (PCA). PCA reduced nonsignificant attribute to created optimal dataset and can improve performance clustering and classification algorithm. Results of this research is clustering and principal feature selection give signifanct impact in accuracy and computation time for internet traffic classification. The combination from this three methods have successfully modeled to generate a data classification method of internet bandwidth usage. - Article
- Jul 2015
- ANN STAT

We proposed a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on an approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms were brought up to better understand how POET works. Such a framework allows us to recover the results for sub-Gaussian in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical data, we proposed a robust estimator based on marginal and multivariate Kendall's tau to satisfy these conditions. In addition, conditional graphical model was also studied under the same framework. The technical tools developed in this paper are of general interest to high dimensional principal component analysis. Thorough numerical results were also provided to back up the developed theory. - Article
- May 2015

We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional factor model implemented by the principal component analysis. Using the extracted factors, we develop a link-free forecasting method, called the sufficient forecasting, which provides several sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even in the presence of a nonparametric forecasting function. The proposed method extends the sufficient dimension reduction to high-dimensional regimes by condensing the cross-sectional information through factor models. We derive asymptotic properties for the estimate of the central subspace spanned by these projection directions as well as the estimates of the sufficient predictive indices. We also show that the natural method of running multiple regression of target on estimated factors yields a linear estimate that actually falls into this central subspace. Our method and theory allow the number of predictors to be larger than the number of observations. We finally demonstrate that the sufficient forecasting improves upon the linear forecasting in both simulation studies and an empirical study of forecasting macroeconomic variables. - Article
- Apr 2015
- Econometrics J

Estimating large covariance and precision matrices are fundamental in modern multivariate analysis. The problems arise from statistical analysis of large panel economics and finance data. The covariance matrix reveals marginal correlations between variables, while the precision matrix encodes conditional correlations between pairs of variables given the remaining variables. In this paper, we provide a selective review of several recent developments on estimating large covariance and precision matrices. We focus on two general approaches: rank based method and factor model based method. Theories and applications of both approaches are presented. These methods are expected to be widely applicable to analysis of economic and financial data.

- ArticleFull-text available
- May 2012

We prove optimal sparsity oracle inequalities for the estimation of covariance matrix under the Frobenius norm. In particular we explore various sparsity structures on the underlying matrix. - Approximation of functions. The second edition ed
- Jan 1986

- G Lorentz

Lorentz, G. (1986). Approximation of functions. The second edition ed. American Mathematical Society, Rhode Island. - Article
- Dec 2002
- J AM STAT ASSOC

This article considers forecasting a single time series when there are many predictors (N) and time series observations (T). When the data follow an approximate factor model, the predictors can be summarized by a small number of indexes, which we estimate using principal components. Feasible forecasts are shown to be asymptotically efficient in the sense that the difference between the feasible forecasts and the infeasible forecasts constructed using the actual values of the factors converges in probability to 0 as both N and T grow large. The estimated factors are shown to be consistent, even in the presence of time variation in the factor model. - Article
- Sep 2011
- J AM STAT ASSOC

In this article a simple two-step estimation procedure of the dynamic factor model is proposed. The estimator allows for heteroscedastic and serially correlated errors. It turns out that the feasible two-step estimator has the same limiting distribution as the generalized least squares (GLS) estimator assuming that the covariance parameters are known. In a Monte Carlo study of the small sample properties, we find that the GLS estimators may be substantially more efficient than the usual estimator based on principal components. Furthermore, it turns out that the iterated version of the estimator may feature considerably improved properties in sample sizes usually encountered in practice. - Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any SNPs are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a novel method based on principal factor approximation, which successfully subtracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive an approximate expression for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent estimate of realized FDP. This result has important applications in controlling FDR and FDP. Our estimate of realized FDP compares favorably with Efron (2007)'s approach, as demonstrated in the simulated examples. Our approach is further illustrated by some real data applications. We also propose a dependence-adjusted procedure, which is more powerful than the fixed threshold procedure.
- Article
- Sep 2013
- J R STAT SOC B

This paper deals with the estimation of a high-dimensional covariance with a conditional sparsity structure and fast-diverging eigenvalues. By assuming sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the Principal Orthogonal complEment Thresholding (POET) method to explore such an approximate factor structure with sparsity. The POET estimator includes the sample covariance matrix, the factor-based covariance matrix (Fan, Fan, and Lv, 2008), the thresholding estimator (Bickel and Levina, 2008) and the adaptive thresholding estimator (Cai and Liu, 2011) as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high-dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the impact of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented. - ArticleFull-text available
- Mar 2013

The aim of this paper is to establish several deep theoretical properties of principal component analysis for multiple-component spike covariance models. Our new results reveal a surprising asymptotic conical structure in critical sample eigendirections under the spike models with distinguishable (or indistinguishable) eigenvalues, when the sample size and/or the number of variables (or dimension) tend to infinity. The consistency of the sample eigenvectors relative to their population counterparts is determined by the ratio between the dimension and the product of the sample size with the spike size. When this ratio converges to a nonzero constant, the sample eigenvector converges to a cone, with a certain angle to its corresponding population eigenvector.In the High Dimension, Low Sample Size case, the angle between the sample eigenvector and its population counterpart converges to a limiting distribution.Several generalizations of the multi-spike covariance models are also explored, and additional theoretical results are presented. - Article
- Feb 2013
- Oxf Bull Econ Stat

In this article, we propose a selection procedure that allows us to consistently estimate the number of dynamic factors in a dynamic factor model. The procedure is based on a canonical correlation analysis of the static factors which has the advantage of being invariant to a rescaling of the factors. Monte Carlo simulations suggest that the proposed selection rule outperforms existing ones, in particular, if the contribution of the common factors to the overall variance is moderate or low. The new selection procedure is applied to the US macroeconomic data panel used in Stock and Watson [NBER working paper 11467 (2005)]. - Article
- Mar 2012
- J AM STAT ASSOC

A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a “cross-dimensional inference” framework that alleviates the problems due to dependence by modeling and removing the variation shared among features while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest. - ArticleFull-text available
- Apr 2012

This paper considers the factor model Xt = ΛFt + et. Assuming a normal distribution for the idiosyncratic error et conditional on the factors {Ft}, conditional maximum likelihood estimators of the factor and factor-loading spaces are derived. These estimators are called generalized principal component estimators (GPCEs) without the normality assumption. This paper derives asymptotic distributions of the GPCEs of the factor and factor-loading spaces. It is shown that variance of the GPCE of the common component is smaller than that of the principal component estimator studied in Bai (2003, Econometrica 71, 135–172). The approximate variance of the forecasting error using the GPCE-based factor estimates is derived and shown to be smaller than that based on the principal component estimator. The feasible GPCE (FGPCE) of the factor space is shown to be asymptotically equivalent to the GPCE. The GPCE and FGPCE are shown to be more efficient than the principal component estimator in finite samples. - Article
- Jan 2009

We modify the criterion by Bai and Ng (2002) for determining the number of factors in approximate factor models. As in the original criterion, for any given number of factors we estimate the common and idiosyncratic components of the model by applying principal component analysis. We select the true number of factors as the number that minimizes the variance explained by the idiosyncratic component. In order to avoid overparametrization, minimization is subject to penalization. At this step, we modify the original procedure by multiplying the penalty function by a positive real number, which allows us to tune its penalizing power, by analogy with the method used by Hallin and Liška (2007) in the frequency domain. The contribution of this paper is twofold. First, our criterion retains the asymptotic properties of the original criterion, but corrects its tendency to overestimate the true number of factors. Second, we provide a computationally easy way to implement the new method by iteratively applying the original criterion. Monte Carlo simulations show that our criterion is in general more robust than the original one. A better performance is achieved in particular in the case of large idiosyncratic disturbances. These conditions are the most difficult for detecting a factor structure but are not unusual in the empirical context. Two applications on a macroeconomic and a financial dataset are also presented. - Article
- Feb 2013
- ANN STAT

A sparse precision matrix can be directly translated into a sparse Gaussian graphical model under the assumption that the data follow a joint normal distribution. This neat property makes high-dimensional precision matrix estimation very appealing in many applications. However, in practice we often face nonnormal data, and variable transformation is often used to achieve normality. In this paper we consider the nonparanormal model that assumes that the variables follow a joint normal distribution after a set of unknown monotone transformations. The nonparanormal model is much more flexible than the normal model while retaining the good interpretability of the latter in that each zero entry in the sparse precision matrix of the nonparanormal model corresponds to a pair of conditionally independent variables. In this paper we show that the nonparanormal graphical model can be efficiently estimated by using a rank-based estimation scheme which does not require estimating these unknown transformation functions. In particular, we study the rank-based graphical lasso, the rank-based neighborhood Dantzig selector and the rank-based CLIME. We establish their theoretical properties in the setting where the dimension is nearly exponentially large relative to the sample size. It is shown that the proposed rank-based estimators work as well as their oracle counterparts defined with the oracle data. Furthermore, the theory motivates us to consider the adaptive version of the rank-based neighborhood Dantzig selector and the rank-based CLIME that are shown to enjoy graphical model selection consistency without assuming the irrepresentable condition for the oracle and rank-based graphical lasso. Simulated and real data are used to demonstrate the finite performance of the rank-based estimators. - Article
- Feb 2013

Estimating and assessing the risk of a large portfolio is an important topic in financial econometrics and risk management. The risk is often estimated by a substitution of a good estimator of the volatility matrix. However, the accuracy of such a risk estimator for large portfolios is largely unknown, and a simple inequality in the previous literature gives an infeasible upper bound for the estimation error. In addition, numerical studies illustrate that this upper bound is very crude. In this paper, we propose factor-based risk estimators under a large amount of assets, and introduce a high-confidence level upper bound (H-CLUB) to assess the accuracy of the risk estimation. The H-CLUB is constructed based on three different estimates of the volatility matrix: sample covariance, approximate factor model with known factors, and unknown factors (POET, Fan, Liao and Mincheva, 2013). For the first time in the literature, we derive the limiting distribution of the estimated risks in high dimensionality. Our numerical results demonstrate that the proposed upper bounds significantly outperform the traditional crude bounds, and provide insightful assessment of the estimation of the portfolio risks. In addition, our simulated results quantify the relative error in the risk estimation, which is usually negligible using 3-month daily data. Finally, the proposed methods are applied to an empirical study. - Article
- Nov 2012
- ANN STAT

Principal component analysis (PCA) is one of the most commonly used statistical procedures with a wide range of applications. This paper considers both minimax and adaptive estimation of the principal subspace in the high dimensional setting. Under mild technical conditions, we first establish the optimal rates of convergence for estimating the principal subspace which are sharp with respect to all the parameters, thus providing a complete characterization of the difficulty of the estimation problem in term of the convergence rate. The lower bound is obtained by calculating the local metric entropy and an application of Fano's Lemma. The rate optimal estimator is constructed using aggregation, which, however, might not be computationally feasible. We then introduce an adaptive procedure for estimating the principal subspace which is fully data driven and can be computed efficiently. It is shown that the estimator attains the optimal rates of convergence simultaneously over a large collection of the parameter spaces. A key idea in our construction is a reduction scheme which reduces the sparse PCA problem to a high-dimensional multivariate regression problem. This method is potentially also useful for other related problems. - Estimation of large covariance matrices has drawn considerable recent attention, and the theoretical focus so far has mainly been on developing a minimax theory over a fixed parameter space. In this paper, we consider adaptive covariance matrix estimation where the goal is to construct a single procedure which is minimax rate optimal simultaneously over each parameter space in a large collection. A fully data-driven block thresholding estimator is proposed. The estimator is constructed by carefully dividing the sample covariance matrix into blocks and then simultaneously estimating the entries in a block by thresholding. The estimator is shown to be optimally rate adaptive over a wide range of bandable covariance matrices. A simulation study is carried out and shows that the block thresholding estimator performs well numerically. Some of the technical tools developed in this paper can also be of independent interest.
- This paper proposes two new estimators for determining the number of factors (r) in approximate factor models. We exploit the well known fact that the r eigenvalues of the variance matrix of N response variables grow unboundedly as N increases while the other eigenvalues remain bounded. The new estimators are obtained simply by maximizing the ratio of two adjacent eigenvalues. Bai and Ng (2002) and Onatski (2006) have developed the methods by which the number of factors can be estimated by comparing the eigenvalues with prespecified or estimated threshold values. Asymptotically, any scalar multiple of a valid threshold value is also valid. However, the finite-sample properties of the estimators depend on the choice of the thresholds. The estimators we propose do not require the use of threshold values. Our simulation results show that the new estimators have good finite sample properties unless the signal-to-noise-ratios of some factors are too low.
- Article
- Oct 2007
- STAT SINICA

This paper deals with a multivariate Gaussian observation model where the eigenvalues of the covariance matrix are all one, except for a finite number which are larger. Of interest is the asymptotic behavior of the eigenvalues of the sample covariance matrix when the sample size and the dimension of the obser-vations both grow to infinity so that their ratio converges to a positive constant. When a population eigenvalue is above a certain threshold and of multiplicity one, the corresponding sample eigenvalue has a Gaussian limiting distribution. There is a "phase transition" of the sample eigenvectors in the same setting. Another contribution here is a study of the second order asymptotics of sample eigenvectors when corresponding eigenvalues are simple and sufficiently large. - Article
- Jul 2011
- J ECONOMETRICS

It is known that the principal component estimates of the factors and the loadings are rota-tions of the underlying latent factors and loadings. We study conditions under which the latent factors can be estimated asymptotically without rotation. We derive the limiting distributions for the factor estimates when N and T are large and make precise how identification of the factors affects inference based on factor augmented regressions. We also consider factor models with additive individual and time effects.. We thank three anonymous referees and Dalibor Stevanovic for helpful comments. The authors acknowledge financial support from the NSF (SES-0962410, SES-0962431). - This paper develops a new estimation procedure for characteristic-based factor models of stock returns. We treat the factor model as a weighted additive nonparametric regression model, with the factor returns serving as time-varying weights, and a set of univariate non-parametric functions relating security characteristic to the associated factor betas. We use a time-series and cross-sectional pooled weighted additive nonparametric regression methodology to simultaneously estimate the factor returns and characteristic-beta functions. By avoiding the curse of dimensionality our methodology allows for a larger number of factors than exist-ing semiparametric methods. We apply the technique to the three-factor Fama-French model, Carhart's four-factor extension of it adding a momentum factor, and a …ve-factor extension adding an own-volatility factor. We …nd that momentum and own-volatility factors are at least as important if not more important than size and value in explaining equity return comove-ments. We test the multifactor beta pricing theory against the Capital Asset Pricing model using a standard test, and against a general alternative using a new nonparametric test.
- The impact of dependence between individual test statistics is currently among the most discussed topics in the multiple testing of high-dimensional data literature, especially since Y. Benjamini and Y. Hochberg [J. R. Stat. Soc., Ser. B 57, No. 1, 289–300 (1995; Zbl 0809.62014)] introduced the false discovery rate (FDR). Many papers have first focused on the impact of dependence on the control of the FDR. Some more recent works have investigated approaches that account for common information shared by all the variables to stabilize the distribution of the error rates. Similarly, we propose to model this sharing of information by a factor analysis structure for the conditional variance of the test statistics. It is shown that the variance of the number of false discoveries increases along with the fraction of common variance. Test statistics for general linear contrasts are deduced, taking advantage of the common factor structure to reduce the variance of the error rates. A conditional FDR estimate is proposed and the overall performance of multiple testing procedure is shown to be markedly improved, regarding the nondiscovery rate, with respect to classical procedures. The present methodology is also assessed by comparison with leading multiple testing methods.
- High-dimensional regression problems which reveal dynamic behavior are typically analyzed by time propagation of a few number of factors. The inference on the whole system is then based on the low-dimensional time series analysis. Such highdimensional problems occur frequently in many different fields of science. In this paper we address the problem of inference when the factors and factor loadings are estimated by semiparametric methods. This more flexible modelling approach poses an important question - Is it justified, from inferential point of view, to base statistical inference on the estimated times series factors? We show that the difference of the inference based on the estimated time series and true unobserved time series is asymptotically negligible. Our results justify fitting vector autoregressive processes to the estimated factors, which allows one to study the dynamics of the whole high-dimensional system with a low-dimensional representation. We illustrate the theory with a simulation study. Also, we apply the method to a study of the dynamic behavior of implied volatilities and discuss other possible applications in finance and economics.
- Article
- Nov 2008
- FOUND COMPUT MATH

We consider a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. Suppose that we observe m entries selected uniformly at random from a matrix M. Can we complete the matrix and recover the entries that we have not seen? We show that one can perfectly recover most low-rank matrices from what appears to be an incomplete set of entries. We prove that if the number m of sampled entries obeys m ≥ C n^(1.2)r log n for some positive numerical constant C, then with very high probability, most n×n matrices of rank r can be perfectly recovered by solving a simple convex optimization program. This program finds the matrix with minimum nuclear norm that fits the data. The condition above assumes that the rank is not too large. However, if one replaces the 1.2 exponent with 1.25, then the result holds for all values of the rank. Similar results hold for arbitrary rectangular matrices as well. Our results are connected with the recent literature on compressed sensing, and show that objects other than signals and images can be perfectly reconstructed from very limited information. - This paper deals with the factor modeling for high-dimensional time series based on a dimension-reduction viewpoint. Under stationary settings, the inference is simple in the sense that both the number of factors and the factor loadings are estimated in terms of an eigenanalysis for a nonnegative definite matrix, and is therefore applicable when the dimension of time series is on the order of a few thousands. Asymptotic properties of the proposed method are investigated under two settings: (i) the sample size goes to infinity while the dimension of time series is fixed; and (ii) both the sample size and the dimension of time series go to infinity together. In particular, our estimators for zero-eigenvalues enjoy faster convergence (or slower divergence) rates, hence making the estimation for the number of factors easier. In particular, when the sample size and the dimension of time series go to infinity together, the estimators for the eigenvalues are no longer consistent. However, our estimator for the number of the factors, which is based on the ratios of the estimated eigenvalues, still works fine. Furthermore, this estimation shows the so-called "blessing of dimensionality" property in the sense that the performance of the estimation may improve when the dimension of time series increases. A two-step procedure is investigated when the factors are of different degrees of strength. Numerical illustration with both simulated and real data is also reported.
- The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
- Article
- May 2012
- ANN STAT

This paper considers the maximum likelihood estimation of factor models of high dimension, where the number of variables (N) is comparable with or even greater than the number of observations (T). An inferential theory is developed. We establish not only consistency but also the rate of convergence and the limiting distributions. Five different sets of identification conditions are considered. We show that the distributions of the MLE estimators depend on the identification restrictions. Unlike the principal components approach, the maximum likelihood estimator explicitly allows heteroskedasticities, which are jointly estimated with other parameters. Efficiency of MLE relative to the principal components method is also considered. - Article
- Apr 2015
- J ECONOMETRICS

Factor model methods recently have become extremely popular in the theory and practice of large panels of time series data. Those methods rely on various factor models which all are particular cases of the Generalized Dynamic Factor Model (GDFM) introduced in Forniet al. (2000). That paper, however, rests on Brillinger’s dynamic principal components. The corresponding estimators are two-sided filters whose performance at the end of the observation period or for forecasting purposes is rather poor. No such problem arises with estimators based on standard principal components, which have been dominant in this literature. On the other hand, those estimators require the assumption that the space spanned by the factors has finite dimension. In the present paper, we argue that such an assumption is extremely restrictive and potentially quite harmful. Elaborating upon recent results by Anderson and Deistler (2008a, b) on singular stationary processes with rational spectrum, we obtain one-sided representations for the GDFM without assuming finite dimension of the factor space. Construction of the corresponding estimators is also briefly outlined. In a companion paper, we establish consistency and rates for such estimators, and provide Monte Carlo results further motivating our approach. - Article
- Sep 2006
- J Empir Finance

We introduce an alternative version of the Fama–French three-factor model of stock returns together with a new estimation methodology. We assume that the factor betas in the model are smooth nonlinear functions of observed security characteristics. We develop an estimation procedure that combines nonparametric kernel methods for constructing mimicking portfolios with parametric nonlinear regression to estimate factor returns and factor betas simultaneously. The methodology is applied to US common stocks and the empirical findings compared to those of Fama and French. - Article
- Dec 2007
- Handbook Econometrics

Often researchers find parametric models restrictive and sensitive to deviations from the parametric specifications; semi-nonparametric models are more flexible and robust, but lead to other complications such as introducing infinite-dimensional parameter spaces that may not be compact and the optimization problem may no longer be well-posed. The method of sieves provides one way to tackle such difficulties by optimizing an empirical criterion over a sequence of approximating parameter spaces (i.e., sieves); the sieves are less complex but are dense in the original space and the resulting optimization problem becomes well-posed. With different choices of criteria and sieves, the method of sieves is very flexible in estimating complicated semi-nonparametric models with (or without) endogeneity and latent heterogeneity. It can easily incorporate prior information and constraints, often derived from economic theory, such as monotonicity, convexity, additivity, multiplicity, exclusion and nonnegativity. It can simultaneously estimate the parametric and nonparametric parts in semi-nonparametric models, typically with optimal convergence rates for both parts.This chapter describes estimation of semi-nonparametric econometric models via the method of sieves. We present some general results on the large sample properties of the sieve estimates, including consistency of the sieve extremum estimates, convergence rates of the sieve M-estimates, pointwise normality of series estimates of regression functions, root-n asymptotic normality and efficiency of sieve estimates of smooth functionals of infinite-dimensional parameters. Examples are used to illustrate the general results. - In this paper, we propose a semiparametric approach, named nonparanormal skeptic, for efficiently and robustly estimating high dimensional undirected graphical models. To achieve modeling flexibility, we consider Gaussian Copula graphical models (or the nonparanormal) as proposed by Liu et al. (2009). To achieve estimation robustness, we exploit nonparametric rank-based correlation coefficient estimators, including Spearman's rho and Kendall's tau. In high dimensional settings, we prove that the nonparanormal skeptic achieves the optimal parametric rate of convergence in both graph and parameter estimation. This celebrating result suggests that the Gaussian copula graphical models can be used as a safe replacement of the popular Gaussian graphical models, even when the data are truly Gaussian. Besides theoretical analysis, we also conduct thorough numerical simulations to compare different estimators for their graph recovery performance under both ideal and noisy settings. The proposed methods are then applied on a large-scale genomic dataset to illustrate their empirical usefulness. The R language software package huge implementing the proposed methods is available on the Comprehensive R Archive Network: http://cran. r-project.org/.
- Article
- Dec 2011
- ANN STAT

Principal component analysis (PCA) is a classical dimension reduction method which projects data onto the principal subspace spanned by the leading eigenvectors of the covariance matrix. However, it behaves poorly when the number of features p is comparable to, or even much larger than, the sample size n. In this paper, we propose a new iterative thresholding approach for estimating principal subspaces in the setting where the leading eigenvectors are sparse. Under a spiked covariance model, we find that the new approach recovers the principal subspace and leading eigenvectors consistently, and even optimally, in a range of high-dimensional sparse settings. Simulated examples also demonstrate its competitive performance. - Article
- Apr 2011
- J MULTIVARIATE ANAL

Sparse Principal Component Analysis (PCA) methods are efficient tools to reduce the dimension (or the number of variables) of complex data. Sparse principal components (PCs) are easier to interpret than conventional PCs, because most loadings are zero. We study the asymptotic properties of these sparse PC directions for scenarios with fixed sample size and increasing dimension (i.e. High Dimension, Low Sample Size (HDLSS)). Under the previously studied spike covariance assumption, we show that Sparse PCA remains consistent under the same large spike condition that was previously established for conventional PCA. Under a broad range of small spike conditions, we find a large set of sparsity assumptions where Sparse PCA is consistent, but PCA is strongly inconsistent. The boundaries of the consistent region are clarified using an oracle result. - This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (s(n) log p(n)/n)(1/2), where s(n) is the number of nonzero elements, p(n) is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λ(n) goes to 0 have been made explicit and compared under different penalties. As a result, for the L(1)-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: sn'=O(pn) at most, among O(pn2) parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where sn' is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
- Article
- Dec 2010
- STAT PROBABIL LETT

The procedure proposed by Bai and Ng (2002) for identifying the number of factors in static factor models is revisited. In order to improve its performance, we introduce a tuning multiplicative constant in the penalty, an idea that was proposed by Hallin and Liška (2007) in the context of dynamic factor models. Simulations show that our method in general delivers more reliable estimates, in particular in the case of large idiosyncratic disturbances. Keywords: Number of factors; Approximate factor models; Information criterion; Model selection - This paper deals with the trace regression model where $n$ entries or linear combinations of entries of an unknown $m_1\times m_2$ matrix $A_0$ corrupted by noise are observed. We propose a new nuclear norm penalized estimator of $A_0$ and establish a general sharp oracle inequality for this estimator for arbitrary values of $n,m_1,m_2$ under the condition of isometry in expectation. Then this method is applied to the matrix completion problem. In this case, the estimator admits a simple explicit form and we prove that it satisfies oracle inequalities with faster rates of convergence than in the previous works. They are valid, in particular, in the high-dimensional setting $m_1m_2\gg n$. We show that the obtained rates are optimal up to logarithmic factors in a minimax sense and also derive, for any fixed matrix $A_0$, a non-minimax lower bound on the rate of convergence of our estimator, which coincides with the upper bound up to a constant factor. Finally, we show that our procedure provides an exact recovery of the rank of $A_0$ with probability close to 1. We also discuss the statistical learning setting where there is no underlying model determined by $A_0$ and the aim is to find the best trace regression model approximating the data.
- Article
- Sep 2010
- J AM STAT ASSOC

We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects' expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N · (N - 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under non-null conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study. - Covariance matrix plays a central role in multivariate statistical analysis. Significant advances have been made recently on developing both theory and methodology for estimating large covariance matrices. However, a minimax theory has yet been developed. In this paper we establish the optimal rates of convergence for estimating the covariance matrix under both the operator norm and Frobenius norm. It is shown that optimal procedures under the two norms are different and consequently matrix estimation under the operator norm is fundamentally different from vector estimation. The minimax upper bound is obtained by constructing a special class of tapering estimators and by studying their risk properties. A key step in obtaining the optimal rate of convergence is the derivation of the minimax lower bound. The technical analysis requires new ideas that are quite different from those used in the more conventional function/sequence estimation problems. Comment: Published in at http://dx.doi.org/10.1214/09-AOS752 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Article
- Dec 2009
- ANN STAT

High-dimensional inference refers to problems of statistical estimation in which the ambient dimension of the data may be comparable to or possibly even larger than the sample size. We study an instance of high-dimensional inference in which the goal is to estimate a matrix $\Theta^* \in \real^{k \times p}$ on the basis of $N$ noisy observations, and the unknown matrix $\Theta^*$ is assumed to be either exactly low rank, or ``near'' low-rank, meaning that it can be well-approximated by a matrix with low rank. We consider an $M$-estimator based on regularization by the trace or nuclear norm over matrices, and analyze its performance under high-dimensional scaling. We provide non-asymptotic bounds on the Frobenius norm error that hold for a general class of noisy observation models, and then illustrate their consequences for a number of specific matrix models, including low-rank multivariate or multi-task regression, system identification in vector autoregressive processes, and recovery of low-rank matrices from random projections. Simulation results show excellent agreement with the high-dimensional scaling of the error predicted by our theory. Comment: Appeared as Stat. technical report, UC Berkeley - Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of the Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a $\rho$-mixing condition and a broad range of sphericity measures of the covariance matrix. Comment: Published in at http://dx.doi.org/10.1214/09-AOS709 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Article
- Apr 2001
- ANN STAT

Let x<sub>(1)</sub> denote the square of the largest singular value of an n × p matrix X, all of whose entries are independent standard Gaussian variates. Equivalently, x<sub>(1)</sub> is the largest principal component variance of the covariance matrix $X'X$, or the largest eigenvalue of a pvariate Wishart distribution on n degrees of freedom with identity covariance. ¶ Consider the limit of large p and n with $n/p = \gamma \ge 1$. When centered by $\mu_p = (\sqrt{n-1} + \sqrt{p})^2$ and scaled by $\sigma_p = (\sqrt{n-1} + \sqrt{p})(1/\sqrt{n-1} + 1/\sqrt{p}^{1/3}$, the distribution of x<sub>(1)</sub> approaches the Tracey-Widom law of order 1, which is defined in terms of the Painlevé II differential equation and can be numerically evaluated and tabulated in software. Simulations show the approximation to be informative for n and p as small as 5. ¶ The limit is derived via a corresponding result for complex Wishart matrices using methods from random matrix theory. The result suggests that some aspects of large p multivariate distribution theory may be easier to apply in practice than their fixed p counterparts. - This paper proposes a factor model with infinite dynamics and nonorthogonal idiosyncratic components. The model, which we call the generalized dynamic-factor model, is novel to the literature and generalizes the static approximate factor model of Chamberlain and Rothschild (1983), as well as the exact factor model à la Sargent and Sims (1977). We provide identification conditions, propose an estimator of the common components, prove convergence as both time and cross-sectional size go to infinity at appropriate rates, and present simulation results. We use our model to construct a coincident index for the European Union. Such index is defined as the common component of real GDP within a model including several macroeconomic variables for each European country. © 2000 by the President and Fellows of Harvard College and the Massachusetts Institute of Technolog
- Article
- Feb 2009
- ANN STAT

This paper considers regularizing a covariance matrix of $p$ variables estimated from $n$ observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and $(\log p)/n\to0$, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data. - ArticleWe develop a general framework for performing large-scale significance testing in the presence of arbitrarily strong dependence. We derive a low-dimensional set of random vectors, called a dependence kernel, that fully captures the dependence structure in an observed high-dimensional dataset. This result shows a surprising reversal of the “curse of dimensionality” in the high-dimensional hypothesis testing setting. We show theoretically that conditioning on a dependence kernel is sufficient to render statistical tests independent regardless of the level of dependence in the observed data. This framework for multiple testing dependence has implications in a variety of common multiple testing problems, such as in gene expression studies, brain imaging, and spatial epidemiology. • empirical null • false discovery rate • latent structure • simultaneous inference • surrogate variable analysis
- High-dimension, low-small-sample size datasets have different geometrical properties from those of traditional low-dimensional data. In their asymptotic study regarding increasing dimensionality with a fixed sample size, Hall et al. (2005) showed that each data vector is approximately located on the vertices of a regular simplex in a high-dimensional space. A perhaps unappealing aspect of their result is the underlying assumption which requires the variables, viewed as a time series, to be almost independent. We establish an equivalent geometric representation under much milder conditions using asymptotic properties of sample covariance matrices. We discuss implications of the results, such as the use of principal component analysis in a high-dimensional space, extension to the case of nonindependent samples and also the binary classification problem.
- ArticleFull-text available
- Apr 2000

This paper develops a new estimation procedure for characteristic-based factor models of stock returns. It describes a factor model in which the factor betas are smooth nonlinear functions of observed security characteristics. It develops an estimation procedure that combines nonparametric kernel methods for constructing mimicking portfolios with parametric nonlinear regression to estimate factor returns and factor betas. Factor models are estimated for UK and US common stocks using book-to-price ratio, market capitalizations, and dividend yield. - This article develops an information criterion for determining the number q of common shocks in the general dynamic factor model developed by Forni et al., as opposed to the restricted dynamic model considered by Bai and Ng and by Amengual and Watson. Our criterion is based on the fact that this number q is also the number of diverging eigenvalues of the spectral density matrix of the observations as the number n of series goes to infinity. We provide sufficient conditions for consistency of the criterion for large n and T (where T is the series length). We show how the method can be implemented and provide simulations and empirics illustrating its very good finite-sample performance. Application to real data adds a new empirical facet to an ongoing debate on the number of factors driving the U.S. economy.
- Article
- Feb 1987
- Econometrica

This paper describes a simple method of calculating a heteroskedasticity and autocorrelation consistent covariance matrix that is positive semi-definite by construction. It also establishes consistency of the estimated covariance matrix under fairly general conditions. - Article
- Feb 2003
- Econometrica

This paper develops an inferential theory for factor models of large dimensions. The principal components estimator is considered because it is easy to compute and is asymptotically equivalent to the maximum likelihood estimator (if normality is assumed). We derive the rate of convergence and the limiting distributions of the estimated factors, factor loadings, and common components. The theory is developed within the framework of large cross sections ("N") and a large time dimension ("T"), to which classical factor analysis does not apply.We show that the estimated common components are asymptotically normal with a convergence rate equal to the minimum of the square roots of "N" and "T". The estimated factors and their loadings are generally normal, although not always so. The convergence rate of the estimated factors and factor loadings can be faster than that of the estimated common components. These results are obtained under general conditions that allow for correlations and heteroskedasticities in both dimensions. Stronger results are obtained when the idiosyncratic errors are serially uncorrelated and homoskedastic. A necessary and sufficient condition for consistency is derived for large "N" but fixed "T". Copyright The Econometric Society 2003. - Article
- Feb 1991
- Econometrica

This paper is concerned with the estimation of covariance matrices in the presence of heteroskedasticity and autocorrelation of unknown forms. Currently available estimators that are designed for this context depend upon the choice of a lag truncation parameter and a weighting scheme. No results are available regarding the choice of lag truncation parameter for a fixed sample size, regarding data-dependent automatic lag truncation parameters, or regarding the choice of weighting scheme. This paper addresses these problems. Asymptotically optimal kernel/weighting scheme and bandwidth/lag truncation parameters are obtained. Using these results, data-dependent automatic bandwidth/lag truncation parameters are introduced. Copyright 1991 by The Econometric Society. - Article
- Feb 2002
- J Am Stat Assoc

This article considers forecasting a single time series when there are many predictors (N) and time series observations (T). When the data follow an approximate factor model, the predictors can be summarized by a small number of indexes, which we estimate using principal components. Feasible forecasts are shown to be asymptotically efficient in the sense that the difference between the feasible forecasts and the infeasible forecasts constructed using the actual values of the factors converges in probability to 0 as both N and T grow large. The estimated factors are shown to be consistent, even in the presence of time variation in the factor model. - Article
- Jan 2001
- Econometrica

In this paper we develop some econometric theory for factor models of large dimensions. The focus is the determination of the number of factors (r), which is an unresolved issue in the rapidly growing literature on multifactor models. We first establish the convergence rate for the factor estimates that will allow for consistent estimation of r. We then propose some panel criteria and show that the number of factors can be consistently estimated using the criteria. The theory is developed under the framework of large cross-sections (N) and large time dimensions (T ). No restriction is imposed on the relation between N and T . Simulations show that the proposed criteria have good finite sample properties in many configurations of the panel data encountered in practice. JEL Classification: C13, C33, C43 Keywords: Factor analysis, asset pricing, principal components, model selection. # Email: Jushan.Bai@bc.edu Phone: 617-552-3689 + Email: Serena.Ng@bc.edu Phone: 617-552-2182 We thank two...