Embracing the Blessing of Dimensionality in Factor Models

ArticleinJournal of the American Statistical Association 113(521) · October 2016with 107 Reads 
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
Cite this publication
Abstract
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data is often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this paper, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate. In fact, even an oracle-like result (as if all the factors were known) can be achieved when a sufficiently large number of variables is used. The idea of utilizing data as much as possible brings computational challenges. A divide-and-conquer algorithm is thus proposed to alleviate the computational burden, and also shown not to sacrifice any statistical accuracy in comparison with a pooled analysis. Simulation studies further confirm our advocacy for the use of full data, and demonstrate the effectiveness of the above algorithm. Our proposal is applied to a microarray data example that shows empirical benefits of using more data.

Do you want to read the rest of this article?

Request Full-text Paper PDF
  • Article
    Principal component analysis continues to be a powerful tool in dimension reduction of high dimensional data. We assume a variance-diverging model and use the high-dimension, low-sample-size asymptotics to show that even though the principal component directions are not consistent, the sample and prediction principal component scores can be useful in revealing the population structure. We further show that these scores are biased, and the bias is asymptotically decomposed into rotation and scaling parts. We propose methods of bias-adjustment that are shown to be consistent and work well in the finite but high dimensional situations with small sample sizes. The potential advantage of bias-adjustment is demonstrated in a classification setting.
  • Preprint
    Factor models are a class of powerful statistical models that have been widely used to deal with dependent measurements that arise frequently from various applications from genomics and neuroscience to economics and finance. As data are collected at an ever-growing scale, statistical machine learning faces some new challenges: high dimensionality, strong dependence among observed variables, heavy-tailed variables and heterogeneity. High-dimensional robust factor analysis serves as a powerful toolkit to conquer these challenges. This paper gives a selective overview on recent advance on high-dimensional factor models and their applications to statistics including Factor-Adjusted Robust Model selection (FarmSelect) and Factor-Adjusted Robust Multiple testing (FarmTest). We show that classical methods, especially principal component analysis (PCA), can be tailored to many new problems and provide powerful tools for statistical estimation and inference. We highlight PCA and its connections to matrix perturbation theory, robust statistics, random projection, false discovery rate, etc., and illustrate through several applications how insights from these fields yield solutions to modern challenges. We also present far-reaching connections between factor models and popular statistical learning problems, including network analysis and low-rank matrix recovery.
  • Article
    Full-text available
    Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional data with continuous measurements. Motivated by the Computer Audition Lab 500-song (CAL500) music annotation study, we develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two data sets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the CAL500 data sheds light on the relationship between acoustic features and semantic annotations, and provides effective means for automatic music annotation and retrieval.
  • Preprint
    Multimodal data, where different types of data are collected from the same subjects, are fast emerging in a large variety of scientific applications. Factor analysis is commonly employed in integrative analysis of multimodal data, and is particularly useful to overcome the curse of high dimensionality and high correlations of multi-modal data. However, there is little work on statistical inference for factor analysis based supervised modeling of multimodal data. In this article, we consider an integrative linear regression model that is built upon the latent factors extracted from multimodal data. We address three important questions: how to infer the significance of one data modality given the other modalities in the model; how to infer the significance of a combination of variables from one modality or across different modalities; and how to quantify the contribution, measured by the goodness-of-fit, of one data modality given the others. When answering each question, we explicitly characterize both the benefit and the extra cost of factor analysis. Those questions, to our knowledge, have not yet been addressed despite wide use of factor analysis in integrative multimodal analysis, and our proposal thus bridges an important gap. We study the empirical performance of our methods through simulations, and further illustrate with a multimodal neuroimaging analysis.
  • Article
    Importance Current guidelines recommend direct oral anticoagulants (DOACs) over warfarin for stroke prevention in patients with atrial fibrillation (AF) who are at high risk. Despite demonstrated efficacy in clinical trials, real-world data of DOACs vs warfarin for secondary prevention in patients with ischemic stroke are largely based on administrative claims or have not focused on patient-centered outcomes. Objective To examine the clinical effectiveness of DOACs (dabigatran, rivaroxaban, or apixaban) vs warfarin after ischemic stroke in patients with AF. Design, Setting, and Participants This cohort study included patients who were 65 years or older, had AF, were anticoagulation naive, and were discharged from 1041 Get With The Guidelines–Stroke–associated hospitals for acute ischemic stroke between October 2011 and December 2014. Data were linked to Medicare claims for long-term outcomes (up to December 2015). Analyses were completed in July 2018. Exposures DOACs vs warfarin prescription at discharge. Main Outcomes and Measures The primary outcomes were home time, a patient-centered measure defined as the total number of days free from death and institutional care after discharge, and major adverse cardiovascular events. A propensity score–overlap weighting method was used to account for differences in observed characteristics between groups. Results Of 11 662 survivors of acute ischemic stroke (median [interquartile range] age, 80 [74-86] years), 4041 (34.7%) were discharged with DOACs and 7621 with warfarin. Except for National Institutes of Health Stroke Scale scores (median [interquartile range], 4 [1-9] vs 5 [2-11]), baseline characteristics were similar between groups. Patients discharged with DOACs (vs warfarin) had more days at home (mean [SD], 287.2 [114.7] vs 263.0 [127.3] days; adjusted difference, 15.6 [99% CI, 9.0-22.1] days) during the first year postdischarge and were less likely to experience major adverse cardiovascular events (adjusted hazard ratio [aHR], 0.89 [99% CI, 0.83-0.96]). Also, in patients receiving DOACs, there were fewer deaths (aHR, 0.88 [95% CI, 0.82-0.95]; P < .001), all-cause readmissions (aHR, 0.93 [95% CI, 0.88-0.97]; P = .003), cardiovascular readmissions (aHR, 0.92 [95% CI, 0.86-0.99]; P = .02), hemorrhagic strokes (aHR, 0.69 [95% CI, 0.50-0.95]; P = .02), and hospitalizations with bleeding (aHR, 0.89 [95% CI, 0.81-0.97]; P = .009) but a higher risk of gastrointestinal bleeding (aHR, 1.14 [95% CI, 1.01-1.30]; P = .03). Conclusions and Relevance In patients with acute ischemic stroke and AF, DOAC use at discharge was associated with better long-term outcomes relative to warfarin.
  • Article
    Full-text available
    We describe studies in molecular profiling and biological pathway analysis that use sparse latent factor and regression models for microarray gene expression data. We discuss breast cancer applications and key aspects of the modeling and computational methodology. Our case studies aim to investigate and characterize heterogeneity of structure related to specific oncogenic pathways, as well as links between aggregate patterns in gene expression profiles and clinical biomarkers. Based on the metaphor of statistically derived "factors" as representing biological "subpathway" structure, we explore the decomposition of fitted sparse factor models into pathway subcomponents and investigate how these components overlay multiple aspects of known biological activity. Our methodology is based on sparsity modeling of multivariate regression, ANOVA, and latent factor models, as well as a class of models that combines all components. Hierarchical sparsity priors address questions of dimension reduction and multiple comparisons, as well as scalability of the methodology. The models include practically relevant non-Gaussian/nonparametric components for latent structure, underlying often quite complex non-Gaussianity in multivariate expression patterns. Model search and fitting are addressed through stochastic simulation and evolutionary stochastic search methods that are exemplified in the oncogenic pathway studies. Supplementary supporting material provides more details of the applications, as well as examples of the use of freely available software tools for implementing the methodology.
  • Regularized wavelet approximations (with discussion)
    • A Antoniadis
    • J Fan
    Antoniadis, A. and Fan, J. (2001). Regularized wavelet approximations (with discussion).
  • Article
    While most of the convergence results in the literature on high dimensional covariance matrix are concerned about the accuracy of estimating the covariance matrix (and precision matrix), relatively less is known about the effect of estimating large covariances on statistical inferences. We study two important models: factor analysis and panel data model with interactive effects, and focus on the statistical inference and estimation efficiency of structural parameters based on large covariance estimators. For efficient estimation, both models call for a weighted principle components (WPC), which relies on a high dimensional weight matrix. This paper derives an efficient and feasible WPC using the covariance matrix estimator of Fan et al. (2013). However, we demonstrate that existing results on large covariance estimation based on absolute convergence are not suitable for statistical inferences of the structural parameters. What is needed is some weighted consistency and the associated rate of convergence, which are obtained in this paper. Finally, the proposed method is applied to the US divorce rate data. We find that the efficient WPC identifies the significant effects of divorce-law reforms on the divorce rate, and it provides more accurate estimation and tighter confidence intervals than existing methods.
  • Article
    Multiple–hypothesis testing involves guarding against much more complicated errors than single–hypothesis testing. Whereas we typically control the type I error rate for a single–hypothesis test, a compound error rate is controlled for multiple–hypothesis tests. For example, controlling the false discovery rate FDR traditionally involves intricate sequential p–value rejection methods based on the observed data. Whereas a sequential p–value method fixes the error rate and estimates its corresponding rejection region, we propose the opposite approach–we fix the rejection region and then estimate its corresponding error rate. This new approach offers increased applicability, accuracy and power. We apply the methodology to both the positive false discovery rate pFDR and FDR, and provide evidence for its benefits. It is shown that pFDR is probably the quantity of interest over FDR. Also discussed is the calculation of the q–value, the pFDR analogue of the p–value, which eliminates the need to set the error rate beforehand as is traditionally done. Some simple numerical examples are presented that show that this new approach can yield an increase of over eight times in power compared with the Benjamini–Hochberg FDR method.
  • Article
    This paper deals with the estimation of a high-dimensional covariance with a conditional sparsity structure and fast-diverging eigenvalues. By assuming sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the Principal Orthogonal complEment Thresholding (POET) method to explore such an approximate factor structure with sparsity. The POET estimator includes the sample covariance matrix, the factor-based covariance matrix (Fan, Fan, and Lv, 2008), the thresholding estimator (Bickel and Levina, 2008) and the adaptive thresholding estimator (Cai and Liu, 2011) as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high-dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the impact of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.
  • Article
    Full-text available
    This paper considers the factor model Xt = ΛFt + et. Assuming a normal distribution for the idiosyncratic error et conditional on the factors {Ft}, conditional maximum likelihood estimators of the factor and factor-loading spaces are derived. These estimators are called generalized principal component estimators (GPCEs) without the normality assumption. This paper derives asymptotic distributions of the GPCEs of the factor and factor-loading spaces. It is shown that variance of the GPCE of the common component is smaller than that of the principal component estimator studied in Bai (2003, Econometrica 71, 135–172). The approximate variance of the forecasting error using the GPCE-based factor estimates is derived and shown to be smaller than that based on the principal component estimator. The feasible GPCE (FGPCE) of the factor space is shown to be asymptotically equivalent to the GPCE. The GPCE and FGPCE are shown to be more efficient than the principal component estimator in finite samples.
  • Article
    Full-text available
    This paper proposes two new estimators for determining the number of factors (r) in approximate factor models. We exploit the well known fact that the r eigenvalues of the variance matrix of N response variables grow unboundedly as N increases while the other eigenvalues remain bounded. The new estimators are obtained simply by maximizing the ratio of two adjacent eigenvalues. Bai and Ng (2002) and Onatski (2006) have developed the methods by which the number of factors can be estimated by comparing the eigenvalues with prespecified or estimated threshold values. Asymptotically, any scalar multiple of a valid threshold value is also valid. However, the finite-sample properties of the estimators depend on the choice of the thresholds. The estimators we propose do not require the use of threshold values. Our simulation results show that the new estimators have good finite sample properties unless the signal-to-noise-ratios of some factors are too low.
  • Article
    Full-text available
    This paper deals with the factor modeling for high-dimensional time series based on a dimension-reduction viewpoint. Under stationary settings, the inference is simple in the sense that both the number of factors and the factor loadings are estimated in terms of an eigenanalysis for a nonnegative definite matrix, and is therefore applicable when the dimension of time series is on the order of a few thousands. Asymptotic properties of the proposed method are investigated under two settings: (i) the sample size goes to infinity while the dimension of time series is fixed; and (ii) both the sample size and the dimension of time series go to infinity together. In particular, our estimators for zero-eigenvalues enjoy faster convergence (or slower divergence) rates, hence making the estimation for the number of factors easier. In particular, when the sample size and the dimension of time series go to infinity together, the estimators for the eigenvalues are no longer consistent. However, our estimator for the number of the factors, which is based on the ratios of the estimated eigenvalues, still works fine. Furthermore, this estimation shows the so-called "blessing of dimensionality" property in the sense that the performance of the estimation may improve when the dimension of time series increases. A two-step procedure is investigated when the factors are of different degrees of strength. Numerical illustration with both simulated and real data is also reported.
  • Article
    Full-text available
    The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
  • Article
    This paper considers the maximum likelihood estimation of factor models of high dimension, where the number of variables (N) is comparable with or even greater than the number of observations (T). An inferential theory is developed. We establish not only consistency but also the rate of convergence and the limiting distributions. Five different sets of identification conditions are considered. We show that the distributions of the MLE estimators depend on the identification restrictions. Unlike the principal components approach, the maximum likelihood estimator explicitly allows heteroskedasticities, which are jointly estimated with other parameters. Efficiency of MLE relative to the principal components method is also considered.
  • Article
    High dimensionality comparable to sample size is common in many statistical problems. We examine covariance matrix estimation in the asymptotic framework that the dimensionality p tends to ∞ as the sample size n increases. Motivated by the Arbitrage Pricing Theory in finance, a multi-factor model is employed to reduce dimensionality and to estimate the covariance matrix. The factors are observable and the number of factors K is allowed to grow with p. We investigate the impact of p and K on the performance of the model-based covariance matrix estimator. Under mild assumptions, we have established convergence rates and asymptotic normality of the model-based estimator. Its performance is compared with that of the sample covariance matrix. We identify situations under which the factor approach increases performance substantially or marginally. The impacts of covariance matrix estimation on optimal portfolio allocation and portfolio risk assessment are studied. The asymptotic results are supported by a thorough simulation study.
  • Article
    This paper identifies five common risk factors in the returns on stocks and bonds. There are three stock-market factors: an overall market factor and factors related to firm size and book-to-market equity. There are two bond-market factors, related to maturity and default risks. Stock returns have shared variation due to the stock-market factors, and they are linked to bond returns through shared variation in the bond-market factors. Except for low-grade corporates, the bond-market factors capture the common variation in bond returns. Most important, the five factors seem to explain average returns on stocks and bonds.
  • Article
    Full-text available
    We hereby propose a novel approach to the identification of ischemic stroke (IS) susceptibility genes that involves converging data from several unbiased genetic and genomic tools. We tested the association between IS and genes differentially expressed between cases and controls, then determined which data mapped to previously reported linkage peaks and were nominally associated with stroke in published genome-wide association studies. We first performed gene expression profiling in peripheral blood mononuclear cells of 20 IS cases and 20 controls. Sixteen differentially expressed genes mapped to reported whole-genome linkage peaks, including the TTC7B gene, which has been associated with major cardiovascular disease. At the TTC7B locus, 46 tagging polymorphisms were tested for association in 565 Portuguese IS cases and 520 controls. Markers nominally associated in at least one test and defining associated haplotypes were then examined in 570 IS Spanish cases and 390 controls. Several polymorphisms and haplotypes in the intron 5-intron 6 region of TTC7B were also associated with IS risk in the Spanish and combined data sets. Multiple independent lines of evidence therefore support the role of TTC7B in stroke susceptibility, but further work is warranted to identify the exact risk variant and its pathogenic potential.
  • Article
    Full-text available
    This paper studies the sparsistency and rates of convergence for estimating sparse covariance and precision matrices based on penalized likelihood with nonconvex penalty functions. Here, sparsistency refers to the property that all parameters that are zero are actually estimated as zero with probability tending to one. Depending on the case of applications, sparsity priori may occur on the covariance matrix, its inverse or its Cholesky decomposition. We study these three sparsity exploration problems under a unified framework with a general penalty function. We show that the rates of convergence for these problems under the Frobenius norm are of order (s(n) log p(n)/n)(1/2), where s(n) is the number of nonzero elements, p(n) is the size of the covariance matrix and n is the sample size. This explicitly spells out the contribution of high-dimensionality is merely of a logarithmic factor. The conditions on the rate with which the tuning parameter λ(n) goes to 0 have been made explicit and compared under different penalties. As a result, for the L(1)-penalty, to guarantee the sparsistency and optimal rate of convergence, the number of nonzero elements should be small: sn'=O(pn) at most, among O(pn2) parameters, for estimating sparse covariance or correlation matrix, sparse precision or inverse correlation matrix or sparse Cholesky factor, where sn' is the number of the nonzero elements on the off-diagonal entries. On the other hand, using the SCAD or hard-thresholding penalty functions, there is no such a restriction.
  • Article
    In this paper we consider estimation of sparse covariance matrices and propose a thresholding procedure which is adaptive to the variability of individual entries. The estimators are fully data driven and enjoy excellent performance both theoretically and numerically. It is shown that the estimators adaptively achieve the optimal rate of convergence over a large class of sparse covariance matrices under the spectral norm. In contrast, the commonly used universal thresholding estimators are shown to be sub-optimal over the same parameter spaces. Support recovery is also discussed. The adaptive thresholding estimators are easy to implement. Numerical performance of the estimators is studied using both simulated and real data. Simulation results show that the adaptive thresholding estimators uniformly outperform the universal thresholding estimators. The method is also illustrated in an analysis on a dataset from a small round blue-cell tumors microarray experiment. A supplement to this paper which contains additional technical proofs is available online.
  • Article
    Full-text available
    Covariance matrix plays a central role in multivariate statistical analysis. Significant advances have been made recently on developing both theory and methodology for estimating large covariance matrices. However, a minimax theory has yet been developed. In this paper we establish the optimal rates of convergence for estimating the covariance matrix under both the operator norm and Frobenius norm. It is shown that optimal procedures under the two norms are different and consequently matrix estimation under the operator norm is fundamentally different from vector estimation. The minimax upper bound is obtained by constructing a special class of tapering estimators and by studying their risk properties. A key step in obtaining the optimal rate of convergence is the derivation of the minimax lower bound. The technical analysis requires new ideas that are quite different from those used in the more conventional function/sequence estimation problems. Comment: Published in at http://dx.doi.org/10.1214/09-AOS752 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
  • Article
    Full-text available
    We consider the estimation of integrated covariance matrices of high dimensional diffusion processes by using high frequency data. We start by studying the most commonly used estimator, the realized covariance matrix (RCV). We show that in the high dimensional case when the dimension p and the observation frequency n grow in the same rate, the limiting empirical spectral distribution of RCV depends on the covolatility processes not only through the underlying integrated covariance matrix Sigma, but also on how the covolatility processes vary in time. In particular, for two high dimensional diffusion processes with the same integrated covariance matrix, the empirical spectral distributions of their RCVs can be very different. Hence in terms of making inference about the spectrum of the integrated covariance matrix, the RCV is in general \emph{not} a good proxy to rely on in the high dimensional case. We then propose an alternative estimator, the time-variation adjusted realized covariance matrix (TVARCV), for a class of diffusion processes. We show that the limiting empirical spectral distribution of our proposed estimator TVARCV does depend solely on that of Sigma through a Marcenko-Pastur equation, and hence the TVARCV can be used to recover the empirical spectral distribution of Sigma by inverting the Marcenko-Pastur equation, which can then be applied to further applications such as portfolio allocation, risk management, etc..
  • Article
    Full-text available
    We propose MC+, a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty. We prove that at a universal penalty level, the MC+ has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of $p\gg n$, and is proved to hold for exactly the MC+ solution among possibly many local minimizers. We prove that the MC+ attains certain minimax convergence rates in probability for the estimation of regression coefficients in $\ell_r$ balls. We use the SURE method to derive degrees of freedom and $C_p$-type risk estimates for general penalized LSE, including the LASSO and MC+ estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level. Comment: Published in at http://dx.doi.org/10.1214/09-AOS729 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
  • Article
    The paradigm of a factor model is very appealing and has been used extensively in economic analyses. Underlying the factor model is the idea that a large number of economic variables can be adequately modelled by a small number of indicator variables. Throughout this extensive research activity on large dimensional factor models a major preoccupation has been the development of tools for determining the number of factors needed for modelling. This paper provides builds on the work of Kapetanios (2004) to provide an alternative method to information criteria as a tool for estimating the number of factors in large dimensional factor models. The new method is robust to considerable cross-sectional and temporal dependence. The theoretical properties of the method are explored and an extensive Monte Carlo study is undertaken. Results are favourable for the new method and suggest that it is a reasonable alternative to existing methods.
  • Article
    This paper considers regularizing a covariance matrix of $p$ variables estimated from $n$ observations, by hard thresholding. We show that the thresholded estimate is consistent in the operator norm as long as the true covariance matrix is sparse in a suitable sense, the variables are Gaussian or sub-Gaussian, and $(\log p)/n\to0$, and obtain explicit rates. The results are uniform over families of covariance matrices which satisfy a fairly natural notion of sparsity. We discuss an intuitive resampling scheme for threshold selection and prove a general cross-validation result that justifies this approach. We also compare thresholding to other covariance estimators in simulations and on an example from climate data.
  • Article
    Full-text available
    Kyoto Encyclopedia of Genes and Genomes (KEGG) is a knowledge base for systematic analysis of gene functions in terms of the networks of genes and molecules. The major component of KEGG is the PATHWAY database that consists of graphical diagrams of biochemical pathways including most of the known metabolic pathways and some of the known regulatory pathways. The pathway information is also represented by the ortholog group tables summarizing orthologous and paralogous gene groups among different organisms. KEGG maintains the GENES database for the gene catalogs of all organisms with complete genomes and selected organisms with partial genomes, which are continuously re-annotated, as well as the LIGAND database for chemical compounds and enzymes. Each gene catalog is associated with the graphical genome map for chromosomal locations that is represented by Java applet. In addition to the data collection efforts, KEGG develops and provides various computational tools, such as for reconstructing biochemical pathways from the complete genome sequence and for predicting gene regulatory networks from the gene expression profiles. The KEGG databases are daily updated and made freely available (http://www.genome.ad.jp/kegg/).
  • Article
    Full-text available
    Expression array data are used to predict biological functions of uncharacterized genes by comparing their expression profiles to those of characterized genes. While biologically plausible, this is both statistically and computationally challenging. Typical approaches are computationally expensive and ignore correlations among expression profiles and functional categories. We propose a factor analysis model (FAM) for functional genomics and give a two-step algorithm, using genome-wide expression data for yeast and a subset of Gene-Ontology Biological Process functional annotations. We show that the predictive performance of our method is comparable to the current best approach while our total computation time was faster by a factor of 4000. We discuss the unique challenges in performance evaluation of algorithms used for genome-wide functions genomics. Finally, we discuss extensions to our method that can incorporate the inherent correlation structure of the functional categories to further improve predictive performance. Our factor analysis model is a computationally efficient technique for functional genomics and provides a clear and unified statistical framework with potential for incorporating important gene ontology information to improve predictions.
  • Article
    This paper develops an inferential theory for factor models of large dimensions. The principal components estimator is considered because it is easy to compute and is asymptotically equivalent to the maximum likelihood estimator (if normality is assumed). We derive the rate of convergence and the limiting distributions of the estimated factors, factor loadings, and common components. The theory is developed within the framework of large cross sections ("N") and a large time dimension ("T"), to which classical factor analysis does not apply.We show that the estimated common components are asymptotically normal with a convergence rate equal to the minimum of the square roots of "N" and "T". The estimated factors and their loadings are generally normal, although not always so. The convergence rate of the estimated factors and factor loadings can be faster than that of the estimated common components. These results are obtained under general conditions that allow for correlations and heteroskedasticities in both dimensions. Stronger results are obtained when the idiosyncratic errors are serially uncorrelated and homoskedastic. A necessary and sufficient condition for consistency is derived for large "N" but fixed "T". Copyright The Econometric Society 2003.
  • Multiple-hypothesis testing involves guarding against much more complicated errors than single-hypothesis testing. Whereas we typically control the type I error rate for a single-hypothesis test, a compound error rate is controlled for multiple-hypothesis tests. For example, controlling the false discovery rate FDR traditionally involves intricate sequential "p"-value rejection methods based on the observed data. Whereas a sequential "p"-value method fixes the error rate and "estimates" its corresponding rejection region, we propose the opposite approach-we "fix" the rejection region and then estimate its corresponding error rate. This new approach offers increased applicability, accuracy and power. We apply the methodology to both the positive false discovery rate pFDR and FDR, and provide evidence for its benefits. It is shown that pFDR is probably the quantity of interest over FDR. Also discussed is the calculation of the "q"-value, the pFDR analogue of the "p"-value, which eliminates the need to set the error rate beforehand as is traditionally done. Some simple numerical examples are presented that show that this new approach can yield an increase of over eight times in power compared with the Benjamini-Hochberg FDR method. Copyright 2002 Royal Statistical Society.
  • Article
    This article considers forecasting a single time series when there are many predictors (N) and time series observations (T). When the data follow an approximate factor model, the predictors can be summarized by a small number of indexes, which we estimate using principal components. Feasible forecasts are shown to be asymptotically efficient in the sense that the difference between the feasible forecasts and the infeasible forecasts constructed using the actual values of the factors converges in probability to 0 as both N and T grow large. The estimated factors are shown to be consistent, even in the presence of time variation in the factor model.
  • Article
    Variable selection is fundamental to high-dimensional statistical modeling, including nonparametric regression. Many approaches in use are stepwise selection procedures, which can be computationally expensive and ignore stochastic errors in the variable selection process. In this article, penalized likelihood approaches are proposed to handle these kinds of problems. The proposed methods select variables and estimate coefficients simultaneously. Hence they enable us to construct confidence intervals for estimated parameters. The proposed approaches are distinguished from others in that the penalty functions are symmetric, nonconcave on (0, ∞), and have singularities at the origin to produce sparse solutions. Furthermore, the penalty functions should be bounded by a constant to reduce bias and satisfy certain conditions to yield continuous solutions. A new algorithm is proposed for optimizing penalized likelihood functions. The proposed ideas are widely applicable. They are readily applied to a variety of parametric models such as generalized linear models and robust regression models. They can also be applied easily to nonparametric modeling by using wavelets and splines. Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known. Our simulation shows that the newly proposed methods compare favorably with other variable selection techniques. Furthermore, the standard error formulas are tested to be accurate enough for practical applications.
  • Article
    In this paper we develop some econometric theory for factor models of large dimensions. The focus is the determination of the number of factors (r), which is an unresolved issue in the rapidly growing literature on multifactor models. We first establish the convergence rate for the factor estimates that will allow for consistent estimation of r. We then propose some panel criteria and show that the number of factors can be consistently estimated using the criteria. The theory is developed under the framework of large cross-sections (N) and large time dimensions (T ). No restriction is imposed on the relation between N and T . Simulations show that the proposed criteria have good finite sample properties in many configurations of the panel data encountered in practice. JEL Classification: C13, C33, C43 Keywords: Factor analysis, asset pricing, principal components, model selection. # Email: Jushan.Bai@bc.edu Phone: 617-552-3689 + Email: Serena.Ng@bc.edu Phone: 617-552-2182 We thank two...
  • Article
    Full-text available
    The paper proposes a method for constructing a sparse estimator for the inverse covariance (concentration) matrix in high-dimensional settings. The estimator uses a penalized normal likelihood approach and forces sparsity by using a lasso-type penalty. We establish a rate of convergence in the Frobenius norm as both data dimension $p$ and sample size $n$ are allowed to grow, and show that the rate depends explicitly on how sparse the true concentration matrix is. We also show that a correlation-based version of the method exhibits better rates in the operator norm. We also derive a fast iterative algorithm for computing the estimator, which relies on the popular Cholesky decomposition of the inverse but produces a permutation-invariant estimator. The method is compared to other estimators on simulated data and on a real data example of tumor tissue classification using gene expression data.