We investigate the asymptotic distributions of coordinates of regression M-estimates in the moderate p/n regime, where the number of covariates p grows proportionally with the sample size n. Under appropriate regularity conditions, we establish the coordinate-wise asymptotic normality of regression M-estimates assuming a fixed-design matrix. Our proof is based on the second-order Poincare inequality (Chatterjee, 2009) and leave-one-out analysis (El Karoui et al., 2011). Some relevant examples are indicated to show that our regularity conditions are satisfied by a broad class of design matrices. We also show a counterexample, namely the ANOVA-type design, to emphasize that the technical assumptions are not just artifacts of the proof. Finally, the numerical experiments confirm and complement our theoretical results.
We consider the fundamental problem of solving quadratic systems of m equations in n variables. We propose a novel method, which starting with an initial guess computed by means of a spectral method, proceeds by minimizing a nonconvex functional as in the Wirtinger flow approach. There are several key distinguishing features, most notably, a distinct objective functional and novel update rules, which operate in an adaptive fashion and drop terms bearing too much influence on the search direction. These careful selection rules provide a tighter initial guess, better descent directions, and thus enhanced practical performance. On the theoretical side, we prove that for certain unstructured models of quadratic systems, our algorithms return the correct solution in linear time, i.e. in time proportional to reading the data and as soon as the ratio m/n between the number of equations and unknowns exceeds a fixed numerical constant. We extend the theory to deal with noisy systems and prove that our algorithms achieve a statistical accuracy, which is nearly un-improvable. We complement our theoretical study with numerical examples showing that solving random quadratic systems is both computationally and statistically not much harder than solving linear systems of the same size---hence the title of this paper. For instance, we demonstrate empirically that the computational cost of our algorithm is about four times that of solving a least-squares problem of the same size.
Community detection is a fundamental problem in social network analysis. It is desirable to have fast community detection algorithm that applicable to large-scale networks we see today and also have reliable community detection results. This paper proposes Spectral Clustering On Ratios-of-eigenvectors (SCORE) as such an algorithm. The algorithm is fast in computation, is applicable to real large networks, and is competitive in real data performance when applied to some representative data sets in this area. We study the algorithm with the Degree-Corrected Block Model (DCBM). Compare to most works in this area that focus on the much narrower Block Models, our analysis is more difficult and requires new techniques. The novelty of our algorithm lies in that it uses entry-wise ratios between leading eigenvectors of the adjacency to reduce a seemingly hard problem to a problem that is very much tractable.
Ting-Li ChenInstitute of Statistical Science, Academia SinicaDai-Ni HsiehInstitute of Statistical Science, Academia SinicaHung HungInstitute of Epidemiology and Preventive Medicine I-Ping TuInstitute of Statistical Science, Academia SinicaPei-Shien WuDept. of Biostatistics, Duke UniversityYi-Ming WuInstitute of Chemistry, Academia SinicaWei-Hau ChangInstitute of Chemistry, Academia SinicaSu-Yun HuangInstitute of Statistical Science, Academia Sinica
Statistics Theory and MethodsData Analysis, Bio-Statistics, Bio-Mathematicsmathscidoc:2004.33002
The Annals of Applied Statistics , 8, (1), 259-285, 2014
Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful
tool for obtaining three-dimensional (3D) structures of biological macromolecules
in native states. A minimum cryo-EM image data set for deriving a
meaningful reconstruction is comprised of thousands of randomly orientated
projections of identical particles photographed with a small number of electrons.
The computation of 3D structure from 2D projections requires clustering,
which aims to enhance the signal to noise ratio in each view by grouping
similarly oriented images. Nevertheless, the prevailing clustering techniques
are often compromised by three characteristics of cryo-EM data: high noise
content, high dimensionality and large number of clusters. Moreover, since
clustering requires registering images of similar orientation into the same
pixel coordinates by 2D alignment, it is desired that the clustering algorithm
can label misaligned images as outliers. Herein, we introduce a clustering algorithm
γ-SUP to model the data with a q-Gaussian mixture and adopt the
minimum γ-divergence for estimation, and then use a self-updating procedure
to obtain the numerical solution. We apply γ-SUP to the cryo-EM images
of two benchmark macromolecules, RNA polymerase II and ribosome.
In the former case, simulated images were chosen to decouple clustering from
alignment to demonstrate γ-SUP is more robust to misalignment outliers than
the existing clustering methods used in the cryo-EM community. In the latter
case, the clustering of real cryo-EM data by our γ-SUP method eliminates
noise in many views to reveal true structure features of ribosome at the projection
We treat all the bivariate lack-of-memory (BLM) distributions in a unified approach and develop some new general properties of the BLM distributions, including joint moment generating function, product moments, and dependence structure. Necessary and sufficient conditions for the survival functions of BLM distributions to be totally positive of order two are given. Some previous results about specific BLM distributions are improved. In particular, we show that both the Marshall–Olkin survival copula and survival function are totally positive of all orders, regardless of parameters. Besides, we point out that Slepian’s inequality also holds true for BLM distributions.
This paper is about the propagation of the singularities in the solutions to the Cauchy problem of the spatially inhomogeneous Boltzmann equation with angular cutoff assumption. It is motivated by the work of BoudinDesvillettes on the propagation of singularities in solutions near vacuum. It shows that for the solution near a global Maxwellian, singularities in the initial data propagate like the free transportation. Precisely, the solution is the sum of two parts in which one keeps the singularities of the initial data and the other one is regular with locally bounded derivatives of fractional order in some Sobolev space. In addition, the dependence of the regularity on the cross-section is also given.
In this paper, we provide the O() corrections to the hydrodynamic model derived by Degond and Motsch from a kinetic version of the model by Vicsek and co-authors describing flocking biological agents. The parameter stands for the ratio of the microscopic to the macroscopic scales. The O() corrected model involves diffusion terms in both the mass and velocity equations as well as terms which are quadratic functions of the first-order derivatives of the density and velocity. The derivation method is based on the standard ChapmanEnskog theory, but is significantly more complex than usual due to both the non-isotropy of the fluid and the lack of momentum conservation.
The approach combines second and fourth order statistics to perform BSS of instantaneous mixtures. It applies for any number of receivers if they are as many as sources. It is a batch algorithm that uses non-Gaussianity and stationarity of source signals. It is linear algebra based direct method, reliable and robust, though large dimensions of sources may slow down the computation significantly. It is however limited to instantaneous mixtures.
We present a novel variation of the well-known infomax algorithm of blind source separation. Under natural gradient descent, the infomax algorithm converges to a stationary point of a limiting ordinary differential equation. However, due to the presence of saddle points or local minima of the corresponding likelihood function, the algorithm may be trapped around these bad stationary points for a long time, especially if the initial data are near them. To speed up convergence, we propose to add a sequence of random perturbations to the infomax algorithm to shake the iterating sequence so that it is captured by a path descending to a more stable stationary point. We analyze the convergence of the randomly perturbed algorithm, and illustrate its fast convergence through numerical examples on blind demixing of stochastic signals. The examples have analytical structures so that saddle points or local minima of the likelihood functions are explicit. The results may have implications for online learning algorithms in dissimilar problems.
We consider diffusivity of random walks with transition probabilities depending on the number of consecutive traversals of the last traversed edge, the so called senile reinforced random walk (SeRW). In one dimension, the walk is known to be sub-diffusive with identity reinforcement function. We perturb the model by introducing a small probability \delta of escaping the last traversed edge at each step. The perturbed SeRW model is diffusive for any \delta , with enhanced diffusivity (\delta ) in the small \delta regime. We further study stochastically perturbed SeRW models by having the last edge escape probability of the form \delta with \delta 's being independent random variables. Enhanced diffusivity in such models are logarithmically close to the so called residual diffusivity (positive in the zero \delta limit), with diffusivity between \delta and \delta . Finally, we generalize our results to higher dimensions where the unperturbed model is already diffusive. The enhanced diffusivity can be as much as \delta .
We study a system of semilinear hyperbolic equations passively advected by smooth white noise in time random velocity fields. Such a system arises in modelling non-premixed isothermal turbulent flames under single-step kinetics of fuel and oxidizer. We derive closed equations for one-point and multi-point probability distribution functions (PDFs) and closed-form analytical formulae for the one-point PDF function, as well as the two-point PDF function under homogeneity and isotropy. Exact solution formulae allow us to analyse the ensemble-averaged fuel/oxidizer concentrations and the motion of their level curves. We recover the empirical formulae of combustion in the thin reaction zone limit and show that these approximate formulae can either underestimate or overestimate average concentrations when the reaction zone is not tending to zero. We show that the averaged reaction rate slows down locally in
We study the enhanced diffusivity in the so called elephant random walk model with stops (ERWS) by including symmetric random walk steps at small probability \epsilon . At any \epsilon , the large time behavior transitions from sub-diffusive at \epsilon to diffusive in a wedge shaped parameter regime where the diffusivity is strictly above that in the un-perturbed ERWS model in the \epsilon limit. The perturbed ERWS model is shown to be solvable with the first two moments and their asymptotics calculated exactly in both one and two space dimensions. The model provides a discrete analytical setting of the residual diffusion phenomenon known for the passive scalar transport in chaotic flows (eg generated by time periodic cellular flows and statistically sub-diffusive) as molecular diffusivity tends to zero.
This paper introduces an efficient approach to integrating non-local statistics into the higher-order Markov Random Fields (MRFs) framework. Motivated by the observation that many non-local statistics (eg, shape priors, color distributions) can usually be represented by a small number of parameters, we reformulate the higher-order MRF model by introducing additional latent variables to represent the intrinsic dimensions of the higher-order cliques. The resulting new model, called NC-MRF, not only provides the flexibility in representing the configurations of higher-order cliques, but also automatically decomposes the energy function into less coupled terms, allowing us to design an efficient algorithmic framework for maximum a posteriori (MAP) inference. Based on this novel modeling/inference framework, we achieve state-of-the-art solutions to the challenging problems of class-specific image segmentation and template-based 3D facial expression tracking, which demonstrate the potential of our approach.
Evolution occurs in populations of reproducing individuals. The structure of a population can affect which traits evolve 1, 2. Understanding evolutionary game dynamics in structured populations remains difficult. Mathematical results are known for special structures in which all individuals have the same number of neighbours 3, 4, 5, 6, 7, 8. The general case, in which the number of neighbours can vary, has remained open. For arbitrary selection intensity, the problem is in a computational complexity class that suggests there is no efficient algorithm 9. Whether a simple solution for weak selection exists has remained unanswered. Here we provide a solution for weak selection that applies to any graph or network. Our method relies on calculating the coalescence times 10, 11 of random walks 12. We evaluate large numbers of diverse population structures for their propensity to favour cooperation. We study how small
When people in a society want to make inference about some parameter, each person would potentially want to use data collected by other people. Information (data) exchange in social contexts is usually costly, so to make sound statistical decisions, people need to compromise between benefits and costs of information acquisition. Conflicts of interests and coordination will arise. Classical statistics does not consider peoples interaction in the data collection process. To address this ignorance, this work explores multi-agent Bayesian inference problems with a game theoretic social network model. Bearing our interest in aggregate inference at the societal level, we propose a new concept finite population learning to address whether with high probability, a large fraction of people can make good inferences. Serving as a foundation, this concept enables us to study the long run trend of aggregate inference quality as population grows.
Economists historically measure the degree to which the market is surprised by an earnings announcement by the consensus forecast error, defined as difference between the actual earnings and the consensus forecast. The consensus might be calculated using either the mean or median of security analysts forecasts. The premise of this measure is that the consensus forecast is a good proxy for the markets expectation of earnings. Hence the consensus forecast error captures how surprised the market is when the earnings is announced. The consensus forecast error is a building block of a host of studies across finance, accounting and economics (see for a survey of event studies in Kothari (2001)). For instance, in finance and accounting, it is used in event studies of how efficiently markets react to earnings announcements. Efficient market studies when it comes to bond or currency markets and macroeconomic
Using high-frequency data, we estimate the risk of a large portfolio with weights being the solution of an optimization problem subject to some linear inequality constraints. We propose a fully nonparametric approach as a benchmark, as well as a factor-based semiparametric approach with observable factors to attack the curse of dimensionality. We provide in-fill asymptotic distributions of the realized volatility estimators of the optimal portfolio, while taking into account the estimation error in the optimal portfolio weights as a result of the covariance matrix estimation. Our theoretical findings suggest that ignoring such an error leads to a first-order asymptotic bias which undermines the statistical inference. Such a bias is related to in-sample optimism in portfolio allocation. Our simulation results suggest satisfactory finite sample performance after bias correction, and that the factor-based approach becomes increasingly superior with a growing cross-sectional dimension. Empirically, using a large cross-section of high-frequency stock returns, we find our estimator successfully addresses the issue of in-sample optimism.
The multiple testing procedure plays an important role in detecting the presence of spatial signals for large scale imaging data. Typically, the spatial signals are sparse but clustered. This paper provides empirical evidence that for a range of commonly used control levels, the conventional FDR procedure can lack the ability to detect statistical significance, even if the -values under the true null hypotheses are independent and uniformly distributed; more generally, ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of the FDR procedure. This paper first introduces a scalar quantity to characterize the extent to which the lack of identification phenomenon(LIP) of the FDR procedure occurs. Second, we propose a new multiple comparison procedure, called FDR, to accommodate the spatial information of neighboring -values, via a local aggregation of -values. Theoretical properties of the FDR procedure are investigated under weak dependence of -values. It is shown that the FDR procedure alleviates the LIP of the FDR procedure, thus substantially facilitating the selection of more stringent control levels. Simulation evaluations indicate that the FDR procedure improves the detection sensitivity of the FDR procedure with little loss in detection specificity. The computational simplicity and detection effectiveness of the FDR procedure are illustrated through a real brain fMRI dataset.
The problem of estimating the density function and the regression function involving errors-in-variables in time series is considered. Under appropriate conditions, it is shown that the rates obtained in Fan (1991), Fan and Truong (1990) are also achievable in the context of dependent observations. Consequently, the results presented here extend our previous results for cross-sectional data to the longitudinal ones. oAbbreviated title. Measurement Errors in Time Series AMS 1980 subject classification. Primary 62G20. Secondary 62G05, 62J99.
Measuring conditional dependence is an important topic in statistics with broad applications including graphical models. Under a factor model setting, a new conditional dependence measure based on projection is proposed. The corresponding conditional independence test is developed with the asymptotic null distribution unveiled where the number of factors could be high-dimensional. It is also shown that the new test has control over the asymptotic significance level and can be calculated efficiently. A generic method for building dependency graphs without Gaussian assumption using the new test is elaborated. Numerical results and real data analysis show the superiority of the new method.
In this note we include a correction for Equation (19) on page 840 which is a step in the proof of Theorem 4 of Fan et al.(2014). There is no change in the statement of Theorem 4, and the rest of the proof stays unchanged. Equation (19) on page 840 should be corrected to as follows: We apply the coordinatewise mean-value theorem with respect to each coordinate of (ie, j) to obtain that
Motivated by the sampling problems and heterogeneity issues common in high-dimensional big datasets, we consider a class of discordant additive index models. We propose method of moments based procedures for estimating the indices of such discordant additive index models in both low and high-dimensional settings. Our estimators are based on factorizing certain moment tensors and are also applicable in the overcomplete setting, where the number of indices is more than the dimensionality of the datasets. Furthermore, we provide rates of convergence of our estimator in both high and low-dimensional setting. Establishing such results requires deriving tensor operator norm concentration inequalities that might be of independent interest. Finally, we provide simulation results supporting our theory. Our contributions extend the applicability of tensor methods for novel models in addition to making progress on understanding theoretical properties of such tensor methods.
Several large volatility matrix estimation procedures have been recently developed for factor-based It processes whose integrated volatility matrix consists of low-rank and sparse matrices. Their performance depends on the accuracy of input volatility matrix estimators. When estimating co-volatilities based on high-frequency data, one of the crucial challenges is non-synchronization for illiquid assets, which makes their co-volatility estimators inaccurate. In this paper, we study how to estimate the large integrated volatility matrix without using co-volatilities of illiquid assets. Specifically, we pretend that the co-volatilities for illiquid assets are missing, and estimate the low-rank matrix using a matrix completion scheme with a structured missing pattern. To further regularize the sparse volatility matrix, we employ the principal orthogonal complement thresholding method (POET). We also investigate the asymptotic