We propose a high-dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called feature augmentation via nonparametrics and selection (FANS). We motivate FANS by generalizing the naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are
Big data is transforming our world, revolutionizing operations and analytics everywhere, from financial engineering to biomedical sciences. The complexity of big data often makes dimension reduction techniques necessary before conducting statistical inference. Principal component analysis, commonly referred to as PCA, has become an essential tool for multivariate data analysis and unsupervised dimension reduction, the goal of which is to find a lower dimensional subspace that captures most of the variation in the dataset. This article provides an overview of methodological and theoretical developments of PCA over the past decade, with focus on its applications to big data analytics. We first review the mathematical formulation of PCA and its theoretical development from the view point of perturbation analysis. We then briefly discuss the relationship between PCA and factor analysis as well as its applications to
Problems of nonparametric ltering arises frequently in engineering and nancial economics. Nonparametric lters often involve some ltering parameters to choose. These parameters can be chosen to optimize the performance locally at each time point or globally over a time interval. In this article, the ltering parameters are chosen via minimizing the prediction error for a large class of lters. Under a general martingale setting, with mild conditions on the time series structure and virtually no assumption on lters, we show that the adaptive lter with ltering parameter chosen by historical data performs nearly as well as the one with the ideal lter in the class, in terms of ltering errors. The theoretical result is also veri ed via intensive simulations. Our approach is also useful for choosing the orders of parametric models such as AR or GARCH processes. It can also be applied to volatility estimation in nancial economics. We illustrate the proposed methods by estimating the volatility of the returns of the S&P500 index and the yields of the three-month Treasury bills.
We propose several statistics to test the Markov hypothesis for -mixing stationary processes sampled at discrete time intervals. Our tests are based on the ChapmanKolmogorov equation. We establish the asymptotic null distributions of the proposed test statistics, showing that Wilkss phenomenon holds. We compute the power of the test and provide simulations to investigate the finite sample performance of the test statistics when the null model is a diffusion process, with alternatives consisting of models with a stochastic mean reversion level, stochastic volatility and jumps.
Multiple hypothesis testing is a fundamental problem in high dimensional inference, with wide applications in many scientific fields. In genome-wide association studies, tens of thousands of tests are performed simultaneously to find if any genes are associated with some traits and those tests are correlated. When test statistics are correlated, false discovery control becomes very challenging under arbitrary dependence. In the current paper, we propose a new methodology based on principal factor approximation, which successfully substracts the common dependence and weakens significantly the correlation structure, to deal with an arbitrary dependence structure. We derive the theoretical distribution for false discovery proportion (FDP) in large scale multiple testing when a common threshold is used and provide a consistent FDP. This result has important applications in controlling FDR and FDP. Our estimate of FDP compares favorably with Efron (2007)'s approach, as demonstrated by in the simulated examples. Our approach is further illustrated by some real data applications.
Principal component analysis (PCA) is fundamental to statistical machine learning. It extracts latent principal factors that contribute to the most variation of the data. When data are stored across multiple machines, however, communication cost can prohibit the computation of PCA in a central location and distributed algorithms for PCA are thus needed. This paper proposes and studies a distributed PCA algorithm: each node machine computes the top K eigenvectors and transmits them to the central server; the central server then aggregates the information from all the node machines and conducts a PCA based on the aggregated information. We investigate the bias and variance for the resulting distributed estimator of the top K eigenvectors. In particular, we show that for distributions with symmetric innovation, the empirical top eigenspaces are unbiased, and hence the distributed PCA is unbiased. We derive the
We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional factor model implemented by the principal component analysis. Using the extracted factors, we develop a novel forecasting method called the sufficient forecasting, which provides a set of sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. The projected principal component analysis will be employed to enhance the accuracy of inferred factors when a semi-parametric factor model is assumed. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even
We study the estimation of the additive components in additive regression models, based on the weighted sample average of regression surface, for stationary <i></i>-mixing processes. Explicit expression of this method makes possible a fast computation and allows an asymptotic analysis. The estimation procedure is especially useful for additive modeling. In this paper, it is shown that the average surface estimator shares the same optimality as the ideal estimator and has the same ability to estimate the additive component as the ideal case where other components are known. Formulas for the asymptotic bias and normality of the estimator are established. A small simulation study is carried out to illustrate the performance of the estimation and a real example is also used to demonstrate our methodology.
We propose a novel Rayleigh quotient based sparse quadratic dimension reduction methodnamed QUADRO (Qua dratic D imension R eduction via Rayleigh O ptimization)for analyzing high-dimensional data. Unlike in the linear setting where Rayleigh quotient optimization coincides with classification, these two problems are very different under nonlinear settings. In this paper, we clarify this difference and show that Rayleigh quotient optimization may be of independent scientific interests. One major challenge of Rayleigh quotient optimization is that the variance of quadratic statistics involves all fourth cross-moments of predictors, which are infeasible to compute for high-dimensional applications and may accumulate too many stochastic errors. This issue is resolved by considering a family of elliptical models. Moreover, for heavy-tail distributions, robust estimates of mean vectors and covariance matrices are
Quadratic regression functionals are important for bandwidth selection of nonparametric regression techniques and for nonparametric goodness-of-fit test. Based on local polynomial regression, we propose estimators for weighted integrals of squared derivatives of regression functions. The rates of convergence in mean square error are calculated under various degrees of smoothness and appropriate values of the smoothing parameter. Asymptotic distributions of the proposed quadratic estimators are considered with the Gaussian noise assumption. It is shown that when the estimators are pseudo-quadratic (linear components dominate quadratic components), asymptotic normality with rate n-1/2 can be achieved.
Generalized linear models and quasi-likelihood method extend the ordinary regression models to accommodate more general conditional distributions of the response. Nonparametric methods need no explicit parametric specification and the resulting model is completely determined by the data themselves. However nonparametric estimation schemes generally have a slower convergence rate such as the local polynomial smoothing estimation of nonparametric generalized linear models studied in Fan, Heckman and Wand (1995). In this work, we propose two parametrically guided nonparametric estimation schemes by incorporating prior shape information on the link transformation of the response variables conditional mean in terms of the predictor variable. Asymptotic results and numerical simulations demonstrate the improvement of our new estimation schemes over the original nonparametric counterpart.
In statistics and machine learning, people are often interested in the eigenvectors (or singular vectors) of certain matrices (eg covariance matrices, data matrices, etc). However, those matrices are usually perturbed by noises or statistical errors, either from random sampling or structural patterns. One usually employs Davis-Kahan \sin theorem to bound the difference between the eigenvectors of a matrix \sin and those of a perturbed matrix \sin , in terms of \sin norm. In this paper, we prove that when \sin is a low-rank and incoherent matrix, the \sin norm perturbation bound of singular vectors (or eigenvectors in the symmetric case) is smaller by a factor of \sin or \sin for left and right vectors, where \sin and \sin are the matrix dimensions. The power of this new perturbation result is shown in robust covariance estimation, particularly when random variables have heavy tails. There, we propose new robust covariance estimators and establish their asymptotic properties using the newly developed perturbation bound. Our theoretical results are verified through extensive numerical experiments.
Parametric option pricing models are widely used in finance. These models capture several features of asset price dynamics; however, their pricing performance can be significantly enhanced when they are combined with nonparametric learning approaches that learn and correct empirically the pricing errors. In this article we propose a new nonparametric method for pricing derivatives assets. Our method relies on the state price distribution instead of the state price density, because the former is easier to estimate nonparametrically than the latter. A parametric model is used as an initial estimate of the state price distribution. Then the pricing errors induced by the parametric model are fitted nonparametrically. This model-guided method, called automatic correction of errors (ACE), estimates the state price distribution nonparametrically. The method is easy to implement and can be combined with any model-based
Long term prediction such as multi-step time series prediction is a challenging prognostics problem. This paper proposes an improved AR time series model called ND-AR model (Nonlinear Degradation AutoRegression) for Remaining Useful Life (RUL) estimation of lithium-ion batteries. The nonlinear degradation feature of the lithiumion battery capacity degradation is analyzed and then the non-linear accelerated degradation factor is extracted to improve the linear AR model. In this model, the nonlinear degradation factor can be obtained with curve fitting, and then the ND-AR model can be applied as an adaptive datadriven prognostics method to monitor degradation time series data. Experimental results with CALCE battery data set show that the proposed nonlinear degradation AR model can realize satisfied prognostics for various lithium-ion batteries with low computing complexity.
We propose a new nonparametric test for detecting the presence of jumps in asset prices using discretely observed data. Compared with the test in At-Sahalia and Jacod (2009), our new test enjoys the same asymptotic properties but has smaller variance. These results are justified both theoretically and numerically. We also propose a new procedure to locate the jumps. The jump identification problem reduces to a multiple comparison problem. We employ the false discovery rate approach to control the probability of type I error. Numerical studies further demonstrate the power of our new method.
The multiple testing procedure plays an important role in detecting the presence of spatial signals for large scale imaging data. Typically, the spatial signals are sparse but clustered. This paper provides empirical evidence that for a range of commonly used control levels, the conventional FDR procedure can lack the ability to detect statistical significance, even if the p-values under the true null hypotheses are independent and uniformly distributed; more generally, ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of the FDR procedure. This paper first introduces a scalar quantity to characterize the extent to which the lack of identification phenomenon(LIP) of the FDR procedure occurs. Second, we propose a new multiple comparison procedure, called FDR L, to accommodate the spatial information of neighboring p-values, via a local aggregation of p
Local quasi-likelihood estimation is useful for nonparametric modeling in a widely-used exponential family of distributions, called generalized linear models. Yet, the technique cannot be directly applied to situations where a response variable is missing at random. Three local quasi-likelihood estimation techniques are introduced: the local quasi-likelihood estimator using only complete-data; the locally weighted quasi-likelihood method; the local quasi-likelihood estimator with imputed values. These estimators share basically the same first order asymptotic biases and variances. Our simulation results show that substantial efficiency gains can be obtained by using the local quasi-likelihood estimator with imputed values. We develop the local quasi-likelihood imputation methods for estimating the mean functional of the response variable. It is shown that the proposed mean imputation estimators are asymptotically normal with asymptotic variance that can be easily estimated. Data from an ongoing environmental epidemiologic study is used to illustrate the proposed methods.
Consider a linear model Y= X + z, where X= X n, p and z~ N (0, I n). The vector is unknown and it is of interest to separate its nonzero coordinates from the zero ones (ie, variable selection). Motivated by examples in long-memory time series (Fan and Yao, 2003) and the change-point problem (Bhattacharya, 1994), we are primarily interested in the case where the Gram matrix G= X X is non-sparse but sparsifiable by a finite order linear filter. We focus on the regime where signals are both rare and weak so that successful variable selection is very challenging but is still possible.
Time- and state-domain methods are two common approaches to nonparametric prediction. Whereas the former uses data predominantly from recent history, the latter relies mainly on historical information. Combining these two pieces of valuable information is an interesting challenge in statistics. We surmount this problem by dynamically integrating information from both the time and state domains. The estimators from these two domains are optimally combined based on a data-driven weighting strategy, which provides a more efficient estimator of volatility. Asymptotic normality is separately established for the time domain, the state domain, and the integrated estimators. By comparing the efficiency of the estimators, we demonstrate that the proposed integrated estimator uniformly dominates the other two estimators. The proposed dynamic integration approach is also applicable to other estimation problems in time
The paper studies estimation of partially linear hazard regression models with varying coefficients for multivariate survival data. A profile pseudopartiallikelihood estimation method is proposed. The estimation of the parameters of the linear part is accomplished via maximization of the profile pseudopartiallikelihood, whereas the varyingcoefficient functions are considered as nuisance parameters that are profiled out of the likelihood. It is shown that the estimators of the parameters are root <i>n</i> consistent and the estimators of the nonparametric coefficient functions achieve optimal convergence rates. Asymptotic normality is obtained for the estimators of the finite parameters and varyingcoefficient functions. Consistent estimators of the asymptotic variances are derived and empirically tested, which facilitate inference for the model. We prove that the varyingcoefficient functions can be estimated as well as if the
Largescale multiple testing with correlated test statistics arises frequently in much scientific research. Incorporating correlation information in approximating the false discovery proportion (FDP) has attracted increasing attention in recent years. When the covariance matrix of test statistics is known, Fan and his colleagues provided an accurate approximation of the FDP under arbitrary dependence structure and some sparsity assumption. However, the covariance matrix is often unknown in many applications and such dependence information must be estimated before approximating the FDP. The estimation accuracy can greatly affect the FDP approximation. In the current paper, we study theoretically the effect of unknown dependence on the testing procedure and establish a general framework such that the FDP can be well approximated. The effects of unknown dependence on approximating the FDP are in the
We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from 2003 to the first half of 2003 . We analyze the data sets from many different perspectives, focusing on (a) productivity, patterns and trends,(b) centrality and (c) community structures.
Functional linear regression analysis aims to model regression relations which include a functional predictor. The analog of the regression parameter vector or matrix in conventional multivariate or multiple-response linear regression models is a regression parameter function in one or two arguments. If, in addition, one has scalar predictors, as is often the case in applications to longitudinal studies, the question arises how to incorporate these into a functional regression model. We study a varying-coefficient approach where the scalar covariates are modeled as additional arguments of the regression parameter function. This extension of the functional linear regression model is analogous to the extension of conventional linear regression models to varying-coefficient models and shares its advantages, such as increased flexibility; however, the details of this extension are more challenging in the functional case. Our
In the analysis of cluster data the regression coefficients are frequently assumed to be the same across all clusters. This hampers the ability to study the varying impacts of factors on each cluster. In this paper, a semiparametric model is introduced to account for varying impacts of factors over clusters by using cluster-level covariates. It achieves the parsimony of parametrization and allows the explorations of nonlinear interactions. The random effect in the semiparametric model accounts also for within cluster correlation. Local linear based estimation procedure is proposed for estimating functional coefficients, residual variance, and within cluster correlation matrix. The asymptotic properties of the proposed estimators are established and the method for constructing simultaneous confidence bands are proposed and studied. In addition, relevant hypothesis testing problems are addressed. Simulation studies are