Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

Article (PDF Available) in The Annals of Statistics 46(3) · February 2015 with 125 Reads

DOI: 10.1214/17-AOS1575 · Source: arXiv

Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries by such data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions on exogeneity of covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given certain number of predictors, namely, the distribution of the correlation of a response variable Y with the best s linear combinations of p covariates X, even when X and Y are independent. When the covariance matrix of X possesses the restricted eigenvalue property, we derive such distributions for both finite s and diverging s, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of X. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where residuals are from regularized fits. Our approach is then applied to construct the upper confidence limit for the maximum spurious correlation and testing exogeneity of covariates. The former provides a baseline for guarding against false discoveries due to data mining and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated by both numerical examples and real data analysis.

Methods used:

Regularization

Histogram of the sample correlations between the residuals and the covariates √ (blue) and histogram of N (0 , 1 / n ) random variables (green).

…

Distributions of maximum spurious correlations for p = 1000 and s = 1, 2, 5 and 10 when Σ is the identity matrix (left panel) or block diagonal (right panel) with the first block being a 500 × 500 equi-correlation matrix with a correlation 0.8 and the second block being the 500 × 500 identity matrix. From top to bottom: n = 50, 100 and 200.

…

Distributions of maximum spurious correlations (blue) and multiplier bootstrap approximations (for a given data set; red) based on 1600 simulations with combinations of p = 2000, 5000, s = 1, 2, 5, 10, and n = 400 when Σ is an identity matrix.

…

Content uploaded by Wen-Xin Zhou

Content may be subject to copyright.

Download full-text PDF

Submitted to the Annals of Statistics

ARE DISCOVERIES SPURIOUS?

DISTRIBUTIONS OF MAXIMUM SPURIOUS

CORRELATIONS AND THEIR APPLICATIONS∗

By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡

Princeton Universityk, Academy of Mathematics and Systems Science∗∗,

Chinese University of Hong Kong††

and University of California, San Diego‡‡

Over the last two decades, many exciting variable selection meth-

ods have been developed for ﬁnding a small group of covariates that

are associated with the response from a large pool. Can the discov-

eries from these data mining approaches be spurious due to high di-

mensionality and limited sample size? Can our fundamental assump-

tions about the exogeneity of the covariates needed for such variable

selection be validated with the data? To answer these questions, we

need to derive the distributions of the maximum spurious correlations

given a certain number of predictors, namely, the distribution of the

correlation of a response variable Ywith the best slinear combina-

tions of pcovariates X, even when Xand Yare independent. When

the covariance matrix of Xpossesses the restricted eigenvalue prop-

erty, we derive such distributions for both a ﬁnite sand a diverging

s, using Gaussian approximation and empirical process techniques.

However, such a distribution depends on the unknown covariance

matrix of X. Hence, we use the multiplier bootstrap procedure to ap-

proximate the unknown distributions and establish the consistency of

such a simple bootstrap approach. The results are further extended

to the situation where the residuals are from regularized ﬁts. Our

approach is then used to construct the upper conﬁdence limit for

the maximum spurious correlation and to test the exogeneity of the

covariates. The former provides a baseline for guarding against false

discoveries and the latter tests whether our fundamental assumptions

for high-dimensional model selection are statistically valid. Our tech-

niques and results are illustrated with both numerical examples and

real data analysis.

1. Introduction. Information technology has forever changed the data

collection process. Massive amounts of very high-dimensional or unstruc-

∗J. Fan was partially supported by NSF Grants DMS-1206464, DMS-1406266, NIH

Grant R01-GM072611-10, and National Center for Mathematics and Interdisciplinary Sci-

ences, Chinese Academy of Sciences, Beijing 100190. Q.-M. Shao was supported in part by

Hong Kong Research Grants Council GRF-403513 and GRF-14302515. W.-X. Zhou was

supported by NIH Grant R01-GM072611-10.

Keywords and phrases: High dimension, spurious correlation, bootstrap, false discovery.

arXiv:1502.04237v3 [math.ST] 11 Apr 2017

2J. FAN, Q.-M. SHAO AND W.-X. ZHOU

tured data are continuously produced and stored at an aﬀordable cost. Mas-

sive and complex data and high dimensionality characterize contemporary

statistical problems in many emerging ﬁelds of science and engineering. Var-

ious statistical and machine learning methods and algorithms have been pro-

posed to ﬁnd a small group of covariate variables that are associated with

given responses such as biological and clinical outcomes. These methods

have been very successfully applied to genomics, genetics, neuroscience, eco-

nomics, and ﬁnance. For an overview of high-dimensional statistical theory

and methods, see the review article by Fan and Lv (2010) and monographs

by Dudoit and van der Laan (2007), Hastie, Tibshirani and Friedman (2009),

Efron (2010) and B¨uhlmann and van de Geer (2011).

Underlying machine learning, data mining, and high-dimensional statis-

tical techniques, there are many model assumptions and even heuristic ar-

guments. For example, the LASSO [Tibshirani (1996)] and the SCAD [Fan

and Li (2001)] are based on an exogeneity assumption, meaning that all of

the covariates and the residual of the true model are uncorrelated. However,

it is nearly impossible that such a random variable, which is the part of

the response variable that can not be explained by a small group of covari-

ates and lives in a low-dimensional space spanned by the response and the

small group of variables, is uncorrelated with any of the tens of thousands

of coviariates. Indeed, Fan and Liao (2014) and Fan, Han and Liu (2014)

provide evidence that such an ideal assumption might not be valid, although

it is a necessary condition for model selection consistency. Even under the

exogenous assumption, conditions such as the restricted eigenvalue condi-

tion [Bickel, Ritov and Tsybakov (2009)] and homogeneity [Fan, Han and

Liu (2014)] are needed to ensure model selection consistency or oracle prop-

erties. Despite their critical importance, these conditions have rarely been

veriﬁed in practice. Their violations can lead to false scientiﬁc discoveries.

A simpler question is then, for a given data set, do data mining techniques

produce results that are better than spurious correlation? The answer de-

pends on not only the correlation between the ﬁtted and observed values,

but also on the sample size, the number of variables selected, and the total

number of variables.

To better appreciate the above two questions, let us consider an example.

We take the gene expression data on 90 Asians (45 Japanese and 45 Han

Chinese) from the international ‘HapMap’ project [Thorisson et al. (2005)].

The normalized gene expression data are generated with an Illumina Sentrix

Human-6 Expression Bead Chip [Stranger et al. (2007)] and are available

on ftp://ftp.sanger.ac.uk/pub/genevar/. We take the expressions of

gene CHRNA6, a cholinergic receptor, nicotinic, alpha 6, as the response

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 3

Fig 1: Histogram of the sample correlations between the residuals and each

covariate (blue) and histogram of N(0,1/√n) random variables (green).

Yand the remaining expressions of probes as covariates Xwith dimension

p= 47292. We ﬁrst ﬁt an `1-penalized least-squares regression (LASSO)

on the data with a tuning parameter automatically selected via ten-fold

cross validation (25 genes are selected). The correlation between the LASSO-

ﬁtted value and the response is 0.8991. Next, we reﬁt an ordinary least-

squares regression on the selected model to calculate the ﬁtted response and

residual vector. The sample correlation between the post-LASSO ﬁt and

observed responses is 0.9214, a remarkable ﬁt! But is it any better than the

spurious correlation? The model diagnostic plot, which depicts the empirical

distribution of the correlations between each covariate Xjand the residual

bεafter the LASSO ﬁt, is given in Figure 1. Does the exogenous assumption

that E(εXj) = 0 for all j= 1, . . . , p hold?

To answer the above two important questions, we need to derive the distri-

butions of the maximum spurious correlations. Let Xbe the p-dimensional

random vector of the covariates and XSbe a subset of covariates indexed by

S. Let dcorrn(ε, αT

SXS) be the sample correlation between the random noise

ε(independent of X) and αT

SXSbased on a sample of size n, where αSis

4J. FAN, Q.-M. SHAO AND W.-X. ZHOU

a constant vector. Then, the maximum spurious correlation is deﬁned as

(1.1) b

Rn(s, p) = max

|S|=smax

αSdcorrn(ε, αT

SXS),

when Xand εare independent, where the maximization is taken over all

p

ssubsets of size sand all of the linear combinations of the selected s

covariates. Next, let (Yi,Xi),...,(Yn,Xn) be independent and identically

distributed (i.i.d.) observations from the linear model Y=XTβ∗+ε. Assume

that scovariates are selected by a certain variable selection method for

some 1 ≤smin(p, n). If the correlation between the ﬁtted response and

observed response is no more than the 90th or the 95th percentile of b

Rn(s, p),

it is hard to claim that the ﬁtted value is impressive or even genuine. In this

case, the ﬁnding is hardly more impressive than the best ﬁt using data

that consist of independent response and explanatory variables, 90% or 95%

of the time. To simplify and unify the terminology, we call this result the

spurious discovery throughout this paper.

For the aforementioned gene expression data, as 25 probes are selected,

the observed correlation of 0.9214 between the ﬁtted value and the response

should be compared with the distribution of b

Rn(25, p). Further, a simple

method to test the null hypothesis

(1.2) E(εXj)=0,for all j= 1, . . . , p,

is to compare the maximum absolute correlation in Figure 1with the dis-

tribution of b

Rn(1, p). See additional details in Section 5.3.

The importance of such spurious correlation was recognized by Cai and

Jiang (2011), Fan, Guo and Hao (2012) and Cai, Fan and Jiang (2013).

When the data are independently and normally distributed, they derive the

distribution of b

Rn(1, p), which is equivalent to the distribution of the mini-

mum angle to the north pole among prandom points uniformly distributed

on the (n+ 1)-dimensional sphere. Fan, Guo and Hao (2012) conducted

simulations to demonstrate that the spurious correlation can be very high

when pis large and grows quickly with s. To demonstrate this eﬀect and to

examine the impact of correlation and sample size, we conduct a similar but

more extensive simulation study based on a combination of the stepwise ad-

dition and branch-and-bound algorithms. We simulate Xfrom N(0,Ip) and

N(0,Σ0), where Σ0is block diagonal with the ﬁrst block being a 500 ×500

equi-correlation matrix with a correlation 0.8 and the second block being

the (p−500) ×(p−500) identity matrix. Yis simulated independently of

Xand follows the standard normal distribution. Figure 2depicts the simu-

lation results for n= 50,100 and 200. Clearly, the distributions depend on

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 5

Fig 2: Distributions of maximum spurious correlations for p= 1000 and

s= 1,2,5 and 10 when Σis the identity matrix (left panel) or block diagonal

(right panel) with the ﬁrst block being a 500 ×500 equi-correlation matrix

with a correlation 0.8 and the second block being the 500 ×500 identity

matrix. From top to bottom: n= 50,100 and 200.

(s, p, n) and Σ, the covariance matrix of X, although the dependence on Σ

does not seem very strong. However, the theoretical result of Fan, Guo and

Hao (2012) covers only the very speciﬁc case where s= 1 and Σ=Ip.

There are several challenges to deriving the asymptotic distribution of the

statistic b

Rn(s, p), as it involves combinatorial optimization. Further techni-

cal complications are added by the dependence among the covariates X.

Nevertheless, under the restricted eigenvalue condition [Bickel, Ritov and

Tsybakov (2009)] on Σ, in this paper, we derive the asymptotic distribution

of such a spurious correlation statistic for both a ﬁxed sand a diverging s,

using the empirical process and Gaussian approximation techniques given in

Chernozhukov, Chetverikov and Kato (2014a). As expected, such distribu-

tions depend on the unknown covariance matrix Σ. To provide a consistent

estimate of the distributions of the spurious correlations, we consider the

use of a multiplier bootstrap method and demonstrate its consistency under

mild conditions. The multiplier bootstrap procedure has been widely used

due to its good numerical performance. Its theoretical validity is guaranteed

6J. FAN, Q.-M. SHAO AND W.-X. ZHOU

by the multiplier central limit theorem [van der Vaart and Wellner (1996)].

For the most advanced recent results, we refer to Chatterjee and Bose (2005),

Arlot, Blanchard and Roquain (2010) and Chernozhukov, Chetverikov and

Kato (2013). In particular, Chernozhukov, Chetverikov and Kato (2013) de-

veloped a number of non-asymptotic results on a multiplier bootstrap for

the maxima of empirical mean vectors in high dimensions with applications

to multiple hypothesis testing and parameter choice for the Dantzig selec-

tor. The use of multiplier bootstrapping enables us to empirically compute

the upper conﬁdence limit of b

Rn(s, p) and hence decide whether discover-

ies by statistical machine learning techniques are any better than spurious

correlations.

The rest of this paper is organized as follows. Section 2discusses the

concept of spurious correlation and introduces the main conditions and no-

tation. Section 3presents the main results of the asymptotic distributions of

spurious correlations and their bootstrap approximations, which are further

extended in Section 4. Section 5identiﬁes three important applications of

our results to high-dimensional statistical inference. Section 6presents the

numerical studies. The proof of Theorem 3.1 is provided in Section 7, and

the proofs for the remaining theoretical results are provided in the supple-

mentary material [Fan, Shao and Zhou (2017)].

2. Spurious correlation, conditions, and notation. Let ε, ε1, . . . , εn

be i.i.d. random variables with a mean of zero and a ﬁnite variance σ2>0,

and let X,X1,...,Xnbe i.i.d. p-dimensional random vectors with a mean

of zero and a covariance matrix Σ=E(XXT)=(σjk )1≤j,k≤p. Write

X= (X1, . . . , Xp)T,Xi= (Xi1, . . . , Xip)T, i = 1, . . . , n.

Assume that the two samples {εi}n

i=1 and {Xi}n

i=1 are independent. Then,

the spurious correlation (1.1) can be written as

(2.1) b

Rn(s, p) = max

α∈Sp−1:|α|0=sdcorrnε, αTX,

where the dimension pand sparsity sare allowed to grow with the sample

size n. Here dcorrn(·,·) denotes the sample Pearson correlation coeﬃcient

and Sp−1:= {α∈Rp:|α|2= 1}is the unit sphere of Rp. Due to the anti-

symmetric property of the sample correlation under the sign transformation

of α, we have also

(2.2) b

Rn(s, p) = max

α∈Sp−1:|α|0=s|dcorrnε, αTX|,

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 7

More speciﬁcally, we can express b

Rn(s, p) as

max

S⊆[p]:|S|=smax

α∈Ss−1Pn

i=1(εi−¯εn)α,Xi,S −¯

Xn,S

qPn

i=1(εi−¯εn)2·Pn

i=1 α,Xi,S −¯

Xn,S2.(2.3)

By the scale-invariance property of b

Rn(s, p), we assume without loss of gen-

erality that σ2= 1 and Σis a correlation matrix, so that diag(Σ) = Ip.

For a random variable X, the sub-Gaussian norm kXkψ2and sub-exponential

norm kXkψ1of Xare deﬁned, respectively, as

kXkψ2= sup

q≥1

q−1/2E|X|q1/q and kXkψ1= sup

q≥1

q−1E|X|q1/q.

A random variable Xthat satisﬁes kXkψ2<∞(resp., kXkψ1<∞) is called

a sub-Gaussian (resp., sub-exponential) random variable [Vershynin (2012)].

The following moment conditions for ε∈Rand X∈Rpare imposed.

Condition 2.1.There exists a random vector Usuch that X=Σ1/2U,

E(U) = 0,E(UUT) = Ipand K1:= supα∈Sp−1kαTUkψ2<∞. The random

variable εhas a zero mean and unit variance, and is sub-Gaussian with

K0:= kεkψ2<∞. Moreover, write vq=E(|ε|q)for q≥3.

The following is our assumption for the sampling process.

Condition 2.2.{εi}n

i=1 and {Xi}n

i=1 are independent random samples

from the distributions of εand X, respectively.

For 1 ≤s≤p, the s-sparse minimal and maximal eigenvalues [Bickel,

Ritov and Tsybakov (2009)] of the covariance matrix Σare deﬁned as

φmin(s) = min

u∈Rp:1≤|u|0≤s|u|Σ/|u|22, φmax(s) = max

u∈Rp:1≤|u|0≤s|u|Σ/|u|22,

where |u|Σ= (uTΣu)1/2and |u|2= (uTu)1/2is the `2-norm of u. Con-

sequently, for 1 ≤s≤p, the s-sparse condition number of Σis given by

(2.4) γs=γs(Σ) = pφmax(s)/φmin (s).

The quantity γsplays an important role in our analysis.

The following notation is used. For the two sequences {an}and {bn}

of positive numbers, we write an=O(bn) or an.bnif there exists a

constant C > 0 such that an/bn≤Cfor all suﬃciently large n; we write

8J. FAN, Q.-M. SHAO AND W.-X. ZHOU

anbnif there exist constants C1, C2>0 such that, for all nlarge enough,

C1≤an/bn≤C2; and we write an∼bnand an=o(bn) if limn→∞ an/bn= 1

and limn→∞ an/bn= 0, respectively. For a, b ∈R, we write a∨b= max(a, b)

and a∧b= min(a, b). For every vector u, we denote by |u|q=Pi≥1|ui|q1/q

for q > 0 and |u|0=Pi≥1I{ui6= 0}. We use hu,vi=uTvto denote the

inner product of two vectors uand vwith the same dimension and kMkto

denote the spectral norm of a matrix M. For every positive integer `, we write

[`] = {1,2, . . . , `}, and for any set S, we use Scto denote its complement

and |S|for its cardinality. For each p-dimensional vector uand p×ppositive

semi-deﬁnite matrix A, we write |u|A= (uTAu)1/2. In particular, put

(2.5) αΣ=α/|α|Σ

for every α∈Rpand set 0Σ=0as the convention.

3. Distributions of maximum spurious correlations. In this sec-

tion, we ﬁrst derive the asymptotic distributions of the maximum spurious

correlation b

Rn(s, p). The analytic form of such asymptotic distributions can

be obtained in the isotropic case. As the asymptotic distributions of b

Rn(s, p)

depend on the unknown covariance matrix Σ, we provide a bootstrap esti-

mate and demonstrate its consistency.

3.1. Asymptotic distributions of maximum spurious correlations. In view

of (2.3), we can rewrite b

Rn(s, p) as

Rn(s, p) = sup

f∈F

n−1Pn

i=1(εi−¯εn)f(Xi−¯

Xn)

pn−1Pn

i=1(εi−¯εn)2·qn−1Pn

i=1 f2(Xi−¯

Xn)

,(3.1)

where ¯εn=n−1Pn

i=1 εi,¯

Xn=n−1Pn

i=1 Xiand

(3.2) F=F(s, p) = x7→ fα(x) := hα,xi:α∈ V

is a class of linear functions Rp7→ R, where V=V(s, p) = {α∈Sp−1:

|α|0=s}. The dependence of Fand Von (s, p) is suppressed.

Let Z= (Z1, . . . , Zp)Tbe a p-dimensional Gaussian random vector with

a mean of zero and the covariance matrix Σ, i.e., Zd

=N(0,Σ). Denote by

(1) ≤Z2

(2) ≤ ··· ≤ Z2

(p)the order statistics of {Z2

1, . . . , Z2

p}. The following

theorem shows that the distribution of the maximum absolute multiple cor-

relation b

Rn(s, p) can be approximated by that of the supremum of a centered

Gaussian process G∗indexed by F.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 9

Theorem 3.1.Let Conditions 2.1 and 2.2 hold, n, p ≥2and 1≤s≤p.

Then there exists a constant C > 0independent of (s, p, n)such that

sup

t≥0P√nb

Rn(s, p)≤t−PR∗(s, p)≤t

≤C(K0K1)3/4n−1/8{sbn(s, p)}7/8,(3.3)

where K0and K1are deﬁned in Condition 2.1,bn(s, p) := log(γsp/s)∨log n

for γsas in (2.4), R∗(s, p) := supf∈F G∗fand G∗={G∗f}f∈F is a centered

Gaussian process indexed by Fdeﬁned as, for every fα∈ F,

(3.4) G∗fα=αT

ΣZ=αTZ

√αTΣα.

In particular, if Σ=Ipand slog(pn) = o(n1/7), then as n→ ∞,

sup

t≥0Pnb

n(s, p)≤t−PZ2

(p)+· ·· +Z2

(p−s+1) ≤t→0.(3.5)

Remark 3.1.The Berry-Esseen bound given in Theorem 3.1 depends

explicitly on the triplet (s, p, n), and it depends on the covariance matrix Σ

only through its s-sparse condition number γs, deﬁned in (2.4). The proof

of (3.3) builds on a number of technical tools including a standard covering

argument, maximal and concentration inequalities for the suprema of un-

bounded empirical processes and Gaussian processes as well as a coupling in-

equality for the maxima of sums of random vectors derived in Chernozhukov,

Chetverikov and Kato (2014a). Instead, if we directly resort to the general

framework in Theorem 2.1 of Chernozhukov, Chetverikov and Kato (2014a),

the function class of interest is F=x7→ αTx

(αTΣα)1/2:α∈Sp−1,|α|0=s.

Checking high-level conditions in Theorem 2.1 can be rather complicated

and less intuitive. Also, dealing with the (uniform) entropy integral that

corresponds to the class Frelies on verifying various VC-type properties,

and thus can be fairly tedious. Following a strategy similar to that used to

prove Theorem 2.1, we provide a self-contained proof of Theorem 3.1 in Sec-

tion 7.2 by making the best use of the speciﬁc structure of F. The proof is

more intuitive and straightforward. More importantly, it leads to an explicit

non-asymptotic bound under transparent conditions.

Remark 3.2.In Theorem 3.1, the independence assumption of εand X

can be relaxed as E(εX) = 0, E(ε2|X) = σ2and E(ε4|X)≤Calmost surely,

where C > 0 is a constant.

10 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Expression (3.5) indicates that the increment n{b

n(s, p)−b

n(s−1, p)}

is approximately the same as Z2

(p−s+1). This can simply be seen from the

asymptotic joint distribution of b

Rn(1, p),b

Rn(2, p),..., b

Rn(s, p). The fol-

lowing proposition establishes the approximation of the joint distributions

when both the dimension pand sparsity sare allowed to diverge with the

sample size n.

Proposition 3.1.Let Conditions 2.1 and 2.2 hold with Σ=Ip. Assume

that the triplet (s, p, n)satisﬁes 1≤s < n ≤pand s2log p=o(n1/7). Then

as n→ ∞,

sup

0≡t0<t1<t2<···<ts<1Ps

k=1 b

Rn(k, p)≤tk

−Ps

k=1 Z2

(p−k+1) ≤n(t2

k−t2

k−1)→0.

Remark 3.3.When s= 1 and if (n, p) satisﬁes log p=o(n1/7), it is

straightforward to verify that, for any t∈R,

(3.6) PZ2

(p)−2 log p+ log(log p)≤t→exp(−π−1/2e−t/2) as p→ ∞.

This result is similar in nature to (5) in Fan, Guo and Hao (2012). In fact, it

is proved in Shao and Zhou (2014) that the extreme-value statistic b

Rn(1, p)

is sensitive to heavy-tailed data in the sense that, under the ultra-high di-

mensional scheme, even the law of large numbers for the maximum spurious

correlation requires exponentially light tails of the underlying distribution.

We refer readers to Theorem 2.1 in Shao and Zhou (2014) for details. There-

fore, we believe that the exponential-type moment assumptions required in

Theorem 3.1 cannot be weakened to polynomial-type ones as long as log pis

allowed to be as large as ncfor some c∈(0,1). However, it is worth mention-

ing that the factor 1/7 in Proposition 3.1 may not be optimal, and according

to the results in Shao and Zhou (2014), 1/3 is the best possible factor to

ensure that the asymptotic theory is valid. To close this gap in theory, a

signiﬁcant amount of additional work and new probabilistic techniques are

needed. We do not pursue this line of research in this paper.

For a general s≥2, we establish in the following proposition the limiting

distribution of the sum of the top sorder statistics of i.i.d. chi-square random

variables with degree of freedom 1.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 11

Proposition 3.2.Assume that s≥2is a ﬁxed integer. For any t∈R,

we have as p→ ∞,

PZ2

(p)+· ·· +Z2

(p−s+1) −sap≤t

−→ π(1−s)/2

(s−1)!Γ(s−1) Zt/s

−∞ Z(t−sv)/2

us−2e−udue−(s−1)v/2g(v)dv,(3.7)

where ap= 2 log p−log(log p),G(t) = exp(−π−1/2e−t/2)and g(t) = G0(t) =

e−t/2

2√πG(t). The above integral can further be expressed as

G(t/s) + π1−s/2e−t/2

(s−1)! Zt/s

−∞

eug(u)du +π(1−s)/2e−t/2

(s−1)!

s−2

j=1 G(t/s)e(j+1)t/(2s)πj/2

`=1

(s−`)−1

j!2jZt/s

−∞

(t−sv)jev/2g(v)dv.

(3.8)

In particular, when s= 2, the last term on the right-hand side of (3.8)

vanishes so that, as p→ ∞,

PZ2

(p)+Z2

(p−1) −2ap≤t→G(t/2) + e−t/2

2√πZt/2

−∞

eu/2G(u)du.

The proofs of Propositions 3.1 and 3.2 are placed in the supplemental

material [Fan, Shao and Zhou (2017)].

3.2. Multiplier bootstrap approximation. The distribution of R∗(s, p) =

supf∈F G∗ffor G∗in (3.4) depends on the unknown Σand thus cannot

be used for statistical inference. In the following, we consider the use of

a Monte Carlo method to simulate a process that mimics G∗, now known

as the multiplier (wild) bootstrap method, which is similar to that used in

Hansen (1996), Barrett and Donald (2003) and Chernozhukov, Chetverikov

and Kato (2013), among others.

Let b

Σnbe the sample covariance matrix based on the data {Xi}n

i=1 and

ξ1, . . . , ξnbe i.i.d. standard normal random variables that are independent

of {εi}n

i=1 and {Xi}n

i=1. Then, given {Xi}n

i=1,

(3.9) Zn=n−1/2

i=1

ξi(Xi−¯

Xn)∼N(0,b

Σn).

12 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

The following result shows that the (unknown) distribution of R∗(s, p) =

supfα∈F fα(Z)

√αTΣαfor Zd

=N(0,Σ) can be consistently estimated by the con-

ditional distribution of

(3.10) RMB

n(s, p) := sup

fα∈F

fα(Zn)

qαTb

Σnα

Theorem 3.2.Let Conditions 2.1 and 2.2 hold. Assume that the triplet

(s, p, n)satisﬁes 1≤s≤pand slog(γspn) = o(n1/5). Then as n→ ∞,

(3.11) sup

t≥0PR∗(s, p)≤t−PRMB

n(s, p)≤tX1,...,XnP

−→ 0.

Remark 3.4.Together, Theorems 3.1 and 3.2 show that the maximum

spurious correlation b

Rn(s, p) can be approximated in distribution by the

multiplier bootstrap statistic n−1/2RMB

n(s, p). In practice, when the sample

size nis relatively small, the value of n−1/2RMB

n(s, p) may exceed 1, which

makes it less favorable as a proxy for spurious correlation. To address this

issue, we propose using the following corrected bootstrap approximation:

(3.12) RCMB

n(s, p) := sup

fα∈F

√n

|ξ|2

fα(Zn)

qαTb

Σnα

where ξ= (ξ1, . . . , ξn)Tis used in the deﬁnition of Zn. By the Cauchy-

Schwarz inequality, n−1/2RCMB

n(s, p) is always between 0 and 1. In view of

(3.10) and (3.12), RCMB

n(s, p) diﬀers from RMB

n(s, p) only up to a multiplica-

tive random factor n−1/2|ξ|2, which in theory is concentrated around 1 with

exponentially high probability. Thus, RMB

nand RCMB

nare asymptotically

equivalent, and (3.11) remains valid with RMB

nreplaced by RCMB

4. Extension to sparse linear models. Suppose that the observed

response Yand p-dimensional covariate Xfollows the sparse linear model

(4.1) Y=XTβ∗+ε,

where the regression coeﬃcient β∗is sparse. The sparsity is typically ex-

plored by the LASSO [Tibshirani (1996)], the SCAD [Fan and Li (2001)], or

the MCP [Zhang (2010)]. Now it is well-known that, under suitable condi-

tions, the SCAD and the MCP, among other folded concave penalized least-

square estimators, also enjoy the unbiasedness property and the (strong)

oracle properties. For simplicity, we focus on the SCAD. For a given random

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 13

sample {(Xi, Yi)}n

i=1, the SCAD exploits the sparsity by pλ-regularization,

which minimizes

(4.2) (2n)−1

i=1

(Yi−XT

iβ)2+

j=1

pλ(|βj|;a)

over β= (β1, . . . , βp)T∈Rp, where pλ(·;a) denotes the SCAD penalty

function [Fan and Li (2001)], i.e., p0

λ(t;a) = λI(t≤λ) + (aλ−t)+

a−1I(t>λ) for

some a > 2, and λ=λn≥0 is a regularization parameter.

Denote by X= (X1,...,Xn)Tthe n×pdesign matrix, Y= (Y1, . . . , Yn)T

the n-dimensional response vector, and ε= (ε1, . . . , εn)T, the n-dimensional

noise vector. Without loss of generality, we assume that β∗= (βT

1,βT

2)T

with each component of β1∈Rsbeing non-zero and β2=0, such that

S0:= supp(β∗) = {1, . . . , s}is the true underlying sparse model of the

indices with s=|β∗|0. Moreover, write X= (X1,X2), where X1∈Rn×s

consists of the columns of Xindexed by S0. In this notation, Y=Xβ+ε=

X1β1+εand the oracle estimator b

βoracle has an explicit form of

βoracle

1= (XT

1X1)−1XT

1Y=β1+ (XT

1X1)−1XT

1ε,b

βoracle

2=0.(4.3)

In other words, the oracle estimator is the unpenalized estimator that min-

imizes Pn

i=1(Yi−XT

i,S0βS0)2over the true support set S0.

Denote by b

εoracle = (bεoracle

1,...,bεoracle

n)T=Y−XTb

βoracle the residuals af-

ter the oracle ﬁt. Then, we can construct the maximum spurious correlation

as in (2.2), except that {εi}n

i=1 is now replaced by {bεoracle

i}n

i=1, i.e.,

Roracle

n(1, p)

= max

j∈[p]|Pn

i=1(bεoracle

i−eT

εoracle)(Xij −¯

Xj)|

qPn

i=1(bεoracle

i−eT

εoracle)2·qPn

i=1(Xij −¯

Xj)2

,(4.4)

where en= (1/n, . . . , 1/n)T∈Rnand ¯

Xj=n−1Pn

i=1 Xij. We here deal

with the speciﬁc case of a spurious correlation of size 1, as this is what is

needed for testing the exogeneity assumption (1.2).

To establish the limiting distribution of b

Roracle

n(1, p), we make the follow-

ing assumptions.

Condition 4.1.Y=Xβ∗+εwith supp (β∗) = {1, . . . , s}and ε=

(ε1, . . . , εn)Tbeing i.i.d. centered sub-Gaussian satisfying that K0=kεikψ2<

∞. The rows of X= (X1,...,Xn)Tare i.i.d. sub-Gaussian random vectors

as in Condition 2.1.

14 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

As before, we can assume that Σ=E(XiXT

i) is a correlation matrix with

diag (Σ) = Ip. Set d=p−sand partition

Σ=Σ11 Σ12

Σ21 Σ22 with Σ11 ∈Rs×s,Σ22 ∈Rd×d,Σ21 =ΣT

12.(4.5)

Let Σ22.1= (eσjk)1≤j,k≤d=Σ22 −Σ21Σ−1

11 Σ12 be the Schur complement of

Σ11 in Σ.

Condition 4.2.eσmin = min1≤j≤deσjj is bounded away from zero.

Theorem 4.1.Assume that Conditions 4.1 and 4.2 hold, and that the

triplet (s, p, n)satisﬁes slog p=o(√n)and log p=o(n1/7). Then the maxi-

mum spurious correlation b

Roracle

n(1, p)in (4.4)satisﬁes that, as n→ ∞,

sup

t≥0P√nb

Roracle

n(1, p)≤t−P|e

Z|∞≤t→0,(4.6)

where e

=N(0,Σ22.1)is a d-variate centered Gaussian random vector with

covariance matrix Σ22.1.

As pλis a folded-concave penalty function, (4.2) is a non-convex opti-

mization problem. The local linear approximation (LLA) algorithm can be

applied to produce a certain local minimum for any ﬁxed initial solution

[Zou and Li (2008), Fan, Xue and Zou (2014)]. In particular, Fan, Xue and

Zou (2014) prove that the LLA algorithm can deliver the oracle estimator

in the folded concave penalized problem with overwhelming probability if it

is initialized by some appropriate initial estimator.

Let b

βLLA be the estimator computed via the one-step LLA algorithm

initiated by the LASSO estimator [Tibshirani (1996)]. That is,

(4.7) b

βLLA = arg min

β(2n)−1

i=1

(Yi−XT

iβ)2+

j=1

λ(|b

βLASSO

j|)|βj|,

where pλis a folded concave penalty, such as the SCAD and MCP penalties,

and b

βLASSO = arg minβ{(2n)−1Pn

i=1(Yi−XT

iβ)2+λ|β|1}. Accordingly,

denote by b

RLLA

n(1, p) the maximum spurious correlation as in (4.4) with

bεoracle

ireplaced by bεLLA

i=Yi−XT

βLLA. Applying Theorem 4.1, we derive

the limiting distribution of b

RLLA

n(1, p) under suitable conditions. First, let

us recall the Restricted Eigenvalue concept formulated by Bickel, Ritov and

Tsybakov (2009).

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 15

Definition 4.1.For any integer s0∈[p]and positive number c0, the

RE (s0, c0)parameter κ(s0, c0,A)of a p×pmatrix Ais deﬁned as

(4.8) κ(s0, c0,A) := min

S⊆[p]:|S|≤s0

min

δ6=0:|δSc|1≤c0|δS|1

δTAδ

|δS|2

Theorem 4.2.Assume that Conditions 4.1 and 4.2 hold, the minimal

signal strength of β∗satisﬁes minj∈S0|βj|>(a+ 1)λfor a, λ as in (4.2),

and that the triplet (s, p, n)satisﬁes slog p=o(√n),slog p

κ(s,3+,Σ)=o(n)for

some  > 0and log p=o(n1/7). If the regularization parameters (λ, λLASSO)

are such that λ≥8√s

κ(s,3,Σ)λLASSO and λLASSO ≥CK0p(log p)/n for C > 0

large enough, then as n→ ∞,

sup

t≥0P√nb

RLLA

n(1, p)≤t−P|e

Z|∞≤t→0,(4.9)

where e

=N(0,Σ22.1).

5. Applications to high-dimensional inferences. This section out-

lines three applications in high-dimensional statistics. The ﬁrst determines

whether discoveries by machine learning and data mining techniques are any

better than those reached by chance. Second, we show that the distributions

of maximum spurious correlations can also be applied to model selection. In

the third application, we validate the fundamental assumption of exogeneity

(1.2) in high dimensions.

5.1. Spurious discoveries. Let qCMB

α(s, p) be the upper α-quantile of the

random variable RCMB

n(s, p) deﬁned by (3.12). Then, an approximate 1 −α

upper conﬁdence limit of the spurious correlation is given by qCMB

α(s, p). In

view of Theorems 3.1 and 3.2, we claim that

(5.1) Pb

Rn(s, p)≤qCMB

α(s, p)→1−α.

To see this, recall that RCMB

n=√nRMB

n/|ξ|2for ξ= (ξ1, . . . , ξn)Tas in

(3.12), and given {Xi}n

i=1,RMB

nis the supremum of a Gaussian process. Let

FMB

n(t) = P{RMB

n(s, p)≤t|X1,...,Xn}be the (conditional) distribution

function of RMB

nand deﬁne t0= inf{t:FMB

n(t)>0}. By Theorem 11.1

of Davydov, Lifshits and Smorodina (1998), FMB

nis absolutely continuous

with respect to the Lebesgue measure and is strictly increasing on (t0,∞),

indicating that P{RCMB

n≤qCMB

α(s, p)|X1,...,Xn}=αalmost surely. This,

together with (3.3) and (3.11), proves (5.1) under Conditions 2.1,2.2, and

when slog(γspn) = o(n1/7).

16 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Let b

Yibe ﬁtted values using spredictors indexed by b

Sselected by a data-

driven technique and Yibe the associated response value. They are denoted

in the vector form by b

Yand Y, respectively. If

(5.2) dcorrn(Y,b

Y)≤qCMB

α(s, p),

then the discovery of variables b

Scan be regarded as spurious; that is, no bet-

ter than by chance. Therefore, the multiplier bootstrap quantile qCMB

α(s, p)

provides an important critical value and yardstick for judging whether the

discovery is spurious, or whether the selected set b

Sincludes too many spu-

rious variables. This yardstick is independent of the method used in the

ﬁtting.

Remark 5.1.The problem of judging whether the discovery is spuri-

ous is intrinsically diﬀerent from that of testing the global null hypothesis

H0:β∗=0, which itself is an important problem in high-dimensional statis-

tical inference and has been well-studied in the literature since the seminal

work of Goeman, van de Geer and van Houwelingen (2006). For example, the

global null hypothesis H0:β∗=0can be rejected by a test; still, the corre-

lation between Yand the variables b

Sselected by a statistical method can be

smaller than the maximum spurious correlation, and we should interpret the

ﬁndings of b

Swith caution. We need either more samples or more powerful

variable selection methods. This motivates us to derive the distribution of

the maximum spurious correlation b

Rn(s, p). This distribution serves as an

important benchmark for judging whether the discovery (of sfeatures from

pexplanatory variables based on a sample of size n) is spurious. The magni-

tude of b

Rn(s, p) gives statisticians an idea of how big a spurious correlation

can be, and therefore an idea of how much the covariates really contribute

to the regression for a given sample size.

5.2. Model selection. In the previous section, we consider the reference

distribution of the maximum spurious correlation statistic b

Rn(s, p) as a

benchmark for judging whether the discovery of ssigniﬁcant variables (among

all of the pvariables using a random sample of size n) is impressive, regard-

less of which variable selection tool is applied. In this section, we show how

the distribution of b

Rn(s, p) can be used to select a model. Intuitively, we

would like to select a model that ﬁts better than the spurious ﬁt. This limits

the candidate sets of models and provides an upper bound on the model

size. In our experience, this upper bound itself provides a model selector.

We now use LASSO as an illustration of the above idea. Owing to spurious

correlation, almost all of the variable selection procedures will, with high

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 17

probability, select a number of spurious variables in the model so that the

selected model is over-ﬁtted. For example, the LASSO method with the

regularization parameter selected by cross-validation typically selects a far

larger model size, as the bias caused by the `1penalty forces the cross-

validation procedure to choose a smaller value of λ. Thus, it is important

to stop the LASSO path earlier and the quantiles of b

Rn(s, p) provide useful

guards.

Speciﬁcally, consider the LASSO estimator b

βλfor the sparse linear model

(4.1) with bsλ=|supp(b

βλ)|, where λ > 0 is the regularization parameter.

We consider the LASSO solution path with the largest knot λini := |XTY|∞

and the smallest knot λcv selected by ten-fold cross-validation. To avoid

over-ﬁtting, we propose using qCMB

αas a guide to choose the regularization

parameter that guards us from selecting too many spurious variables. For

each λin the path, we compute dcorrn(b

Yλ,Y), the sample correlation between

the post-LASSO ﬁtted and observed responses, and qCMB

α(bsλ, p). Let b

λαbe

the largest λsuch that the sign of dcorrn(b

Yλ,Y)−qCMB

α(bsλ, p) is nonnegative

and then ﬂips in the subsequent knot. The selected model is given by b

Sα=

supp(b

βb

λα). As demonstrated by the simulation studies in Section 6.4, this

procedure selects a much smaller model size that is closer to the real data.

5.3. Validating exogeneity. Fan and Liao (2014) show that the exogenous

condition (1.2) is necessary for penalized least-squares to achieve a model

selection consistency. They question the validity of such an exogeneous as-

sumption, as it imposes too many equations. They argue further that even

when the exogenous model holds for important variables XS, i.e.,

(5.3) Y=XT

Sβ∗

S+ε, E(εXS) = 0,

the extra variables XN(with N=Sc) are collected in an eﬀort to cover the

unknown set S— but no veriﬁcation of the conditions

(5.4) E(εXN) = E{(Y−XT

Sβ∗

S)XN}=0

has ever been made. The equality E{(Y−XT

Sβ∗

S)Xj}= 0 in (5.4) holds

by luck for some covariate Xj, but it can not be expected that this holds

for all j∈N. They propose a focussed generalized method of moment

(FGMM) to avoid the unreasonable assumption (5.4). Recognizing (5.3) is

not identiﬁable in high-dimensional linear models, they impose additional

conditions such as E(εX2

S) = 0.

Despite its fundamental importance to high-dimensional statistics, there

are no available tools for validating (1.2). Regarding (1.2) as a null hypoth-

esis, an asymptotically α-level test can be used to reject assumption (1.2)

18 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

when

(5.5) b

Tn,p = max

j∈[p]√ndcorrn(Xj, ε)≥qCMB

α(1, p).

By Theorems 3.1 and 3.2, the test statistic has an approximate size α. The

p-value of the test can be computed via the distribution of the Gaussian

multiplier process RCMB

n(1, p).

As pointed out in the introduction, when the components of Xare weakly

correlated, the distribution of the maximum spurious correlation does not

depend very sensitively on Σ. See also Lemma 6 in Cai, Liu and Xia (2014).

In this case, we can approximate it by the identity matrix, and hence one

can compare the renormalized test statistic

(5.6) Jn,p =b

n,p −2 log p+ log(log p)

with the limiting distribution in (3.6). The critical value for test statistic

Jn,p is

(5.7) Jα=−2 log{−√πlog(1 −α)},

and the associated p-value is given by

(5.8) exp(−π−1/2e−Jn,p/2).

Expressions (5.7) and (5.8) provide analytic forms for a quick validation of

the exogenous assumption (1.2) under weak dependence. In general, we rec-

ommend using the wild bootstrap, which takes into account the correlation

eﬀect and provides more accurate estimates especially when the dependence

is strong. See Chang et al. (2017) for more empirical evidences.

In practice, εis typically unknown to us. Therefore, b

Tn,p in (5.5) is calcu-

lated using the ﬁtted residuals {bεLLA

i}n

i=1. In view of Theorem 4.2, we need

to adjust the null distribution according to (4.9). By Theorem 3.2, we adjust

the deﬁnition of the process Znin (3.9) by

(5.9) ZLLA

n=n−1/2

i=1

ξi(XLLA

i−XLLA

n)∈Rp−|

S|,

where XLLA

i=Xi,

N−b

Σb

Σ−1

SXi,

Sis the residuals of Xb

Nregressed on

S, where b

Sis the set of selected variables, b

N= [p]\b

S, and b

ΣSS0denotes

the sub-matrix of b

Σncontaining entries indexed by (k, `)∈S×S0. From

(5.9), the multiplier bootstrap approximation of |e

Z|∞is RMB,LLA

n(1, p) =

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 19

D−1/2ZLLA

n|∞, where b

D= diagonal matrix of the sample covariance matrix

of {XLLA

i}n

i=1. Consequently, we reject (1.2) if b

Tn,p > qMB,LLA

α(1, p), where

qMB,LLA

α(1, p) is the (conditional) upper α-quantile of RMB,LLA

n(1, p) given

{Xi}n

i=1.

Remark 5.2.To the best of our knowledge, this is the ﬁrst paper to

consider testing the exogenous assumption (1.2), for which we use the maxi-

mum correlation between covariates and ﬁtted residuals as the test statistic.

A referee kindly informed us in his/her review report that in the context

of speciﬁcation testing, Chernozhukov, Chetverikov and Kato (2013) pro-

pose a similar extreme value statistic and use the multiplier bootstrap to

compute a critical value for the test. To construct marginal test statistics,

they use self-normalized covariances between generated regressors and ﬁtted

residuals obtained via ordinary least squares, whereas we use sample cor-

relations between the covariates and ﬁtted residuals obtained by the LLA

algorithm. We refer readers to Appendix M in the supplementary material

of Chernozhukov, Chetverikov and Kato (2013) for more details.

6. Numerical studies. In this section, Monte Carlo simulations are

used to examine the ﬁnite-sample performance of the bootstrap approxi-

mation (for a given data set) of the distribution of the maximum spurious

correlation (MSC).

6.1. Computation of spurious correlation. First, we observe that b

Rn(s, p)

in (2.2) can be written as b

n(s, p) = bσ−2

εmaxS⊆[p]:|S|=svT

n,S b

Σ−1

SS vn,S , where

bσ2

ε=n−1Pn

i=1(εi−¯εn)2and vn=n−1Pn

i=1(εi−¯εn)(Xi−¯

Xn). Therefore,

the computation of b

Rn(s, p) requires solving the combinatorial optimization

problem

(6.1) b

S= arg max

S⊆[p]:|S|=svT

n,S b

Σ−1

SS vn,S .

It is computationally intensive to obtain b

Sfor large values of pand sas

one essentially needs to enumerate all p

spossible subsets of size sfrom p

covariates. A fast and easily implementable approach is to use the stepwise

addition (forward selection) algorithm as in Fan, Guo and Hao (2012), which

results in some value that is no larger than b

Rn(s, p) but avoids computing all

p

smultiple correlations in (6.1). Note that the optimization (6.1) is equiva-

lent to ﬁnding the best subset regression of size s. When pis relatively small,

say if pranges from 20 to 40, the branch-and-bound procedure is commonly

used for ﬁnding the best subset of a given size that maximizes multiple R2

20 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

[Brusco and Stahl (2005)]. However, this approach becomes computational

infeasible very quickly when there are hundreds or thousands of potential

predictors. As a trade-oﬀ between approximation accuracy and computa-

tional intensity, we propose using a two-step procedure that combines the

stepwise addition and branch-and-bound algorithms. First, we use the for-

ward selection to pick the best dvariables, say d= 40, which serves as a

pre-screening step. Second, across the d

ssubsets of size s, the branch-and-

bound procedure is implemented to select the best subset that maximizes

the multiple-R2. This subset is used as an approximate solution to (6.1).

Note that when s > 40, which is rare in many applications, we only use the

stepwise addition to reduce the computational cost.

6.2. Accuracy of the multiplier bootstrap approximation. For the ﬁrst

simulation, we consider the case where the random noise εfollows the uni-

form distribution standardized so that E(ε) = 0 and E(ε2) = 1. Independent

of ε, the p-variate vector Xof covariates has i.i.d. N(0,1) components. In the

results reported in Table 1, the ambient dimension p= 2000, the sample size

ntakes a value in {400,800,1200}, and stakes a value in {1,2,5,10}. For

a given signiﬁcance level α∈(0,1), let qα(s, p) be the upper α-quantile of

Rn(s, p) in (2.1). For each data set Xn={X1,...,Xn}, a direct application

of Theorems 3.1 and 3.2 is that

cMB(Xn, α) := PRCMB

n(s, p)≥qα(s, p)|Xn→αas n→ ∞.

The diﬀerence cMB(Xn, α)−α, however, characterizes the extent of the size

distortions and the ﬁnite-sample accuracy of the multiplier bootstrap ap-

proximation (MBA). Table 1summarizes the mean and the standard de-

viation (SD) of cMB(Xn, α) based on 200 simulated data sets with α∈

{0.05,0.1}. The α-quantile qα(s, p) is calculated from 1600 replications, and

cMB(Xn, α) for each data set is simulated based on 1600 bootstrap replica-

tions. In addition, we report in Figure 3the distributions of the maximum

spurious correlations and their multiplier bootstrap approximations condi-

tional on a given data set Xnwhen p∈ {2000,5000},s∈ {1,2,5,10}and

n= 400. Together, Table 1and Figure 3show that the multiplier bootstrap

method indeed provides a quite good approximation to the (unknown) dis-

tribution of the maximum spurious correlation.

For the second simulation, we focus on an anisotropic case where the co-

variance matrix Σof Xis non-identity, and the condition number of Σis

well-controlled. Speciﬁcally, we assume that εfollows the centered Laplace

distribution rescaled so that E(ε) = 0 and E(ε2) = 1. To introduce de-

pendence among covariates, ﬁrst we denote with Aa 10 ×10 symmet-

ric positive deﬁnite matrix with a pre-speciﬁed condition number c > 1

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 21

Table 1

(Isotropic case) The mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD

in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05

Trule s= 1 s= 2 s= 5 s= 10

n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05

400 9.54 4.68 9.13 4.38 9.08 3.78 8.67 4.44

(0.643) (0.294) (0.568) (0.284) (0.480) (0.245) (0.506) (0.291)

800 9.43 4.93 9.47 4.42 9.73 4.73 9.94 5.62

(0.444) (0.296) (0.474) (0.296) (0.488) (0.294) (0.557) (0.331)

1200 9.09 4.32 9.00 4.46 9.42 4.87 9.97 5.15

(0.507) (0.261) (0.542) (0.278) (0.543) (0.322) (0.579) (0.318)

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Density

30 (n,p) = (400,2000)

MSC-1

MBA-1

MSC-2

MBA-2

MSC-5

MBA-5

MSC-10

MBA-10

Maximum Spurious Correlation vs Multiplier Bootstrap Approximation

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Density

40 (n,p) = (400,5000)

Fig 3: Distributions of maximum spurious correlations (blue) and multiplier

bootstrap approximations (for a given data set; red) based on 1600 simula-

tions with combinations of p= 2000,5000, s= 1,2,5,10, and n= 400 when

Σis an identity matrix.

and let ρ∈(0,1). Then the p-dimensional vector Xof the covariates is

22 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

generated according to X=G1(ρ)Z1+G2(ρ)Z2, where Z1d

=N(0,A),

Z2d

=N(0,Ip−10), and G1(ρ)∈Rp×10 ,G2(ρ)∈Rp×(p−10) are given respec-

tively by G1(ρ)T= (I10,ρ

√1+ρ2I10,G11(ρ)T) with

G11(ρ) = 1−ρ

p1 + (1 −ρ)2







1 0 . . . 0

.··· .

1 0 . . . 0





∈R(p−20)×10

and

G2(ρ) = 





010×10 010×(p−20)

√1+ρ2I10 010×(p−20)

0(p−20)×10 1

√1+(1−ρ)2Ip−20





.

In particular, we take c= 5 and ρ= 0.8 in the simulations reported in

Table 2, which summarizes the mean and the standard deviation (SD) of

the size cMB(Xn, α) based on 200 simulated data sets with α∈ {0.05,0.1}.

Comparing the simulation results shown in Tables 1and 2, we ﬁnd that

the bootstrap approximation is fairly robust against heterogeneity in the

covariance structure of the covariates.

Table 2

(Anisotropic case) Mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD

in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05

Trule s= 1 s= 2 s= 5 s= 10

n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05

400 9.83 4.39 9.04 4.75 9.27 4.65 9.34 4.53

(0.426) (0.222) (0.402) (0.208) (0.492) (0.273) (0.557) (0.291)

800 10.18 5.19 10.48 5.12 9.98 4.86 9.21 4.73

(0.556) (0.296) (0.519) (0.272) (0.576) (0.220) (0.474) (0.269)

1200 9.42 4.41 9.60 5.71 9.90 4.85 10.11 5.19

(0.500) (0.233) (0.543) (0.339) (0.553) (0.333) (0.606) (0.337)

6.3. Detecting spurious discoveries. To examine how the multiplier boot-

strap quantile qCMB

α(s, p) (see Section 5.1) serves as a benchmark for judging

whether the discovery is spurious, we compute the Spurious Discovery Prob-

ability (SDP) by simulating 200 data sets from (4.1) with n= 100,120,160,

p= 400, β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp, and standard Gaus-

sian noise εd

=N(0,1). For some integer s≤r≤p, we let xd

=N(0,Ir) be an

r-dimensional Gaussian random vector. Let Γrbe a p×rmatrix satisfying

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 23

ΓT

rΓr=Ir. The rows of the design matrix Xare sampled as i.i.d. copies from

Γrx∈Rp, where rtakes a value in {120,160,200,240,280,320,360}. To save

space, we give the numerical results for the case of non-Gaussian design and

noise in the supplementary material [Fan, Shao and Zhou (2017)].

Put Y= (Y1, . . . , Yn)Tand let b

Y=Xb

βpLASSO be the n-dimensional

vector of ﬁtted values, where b

βpLASSO = (XT

SXb

S)−1XT

SYis the post-LASSO

estimator using covariates selected by the ten-fold cross-validated LASSO

estimator. Let bs=|b

S|0be the number of variables selected. For α∈(0,1),

the level-αSDP is deﬁned as P{|dcorrn(Y,b

Y)| ≤ qCMB

α(bs, p)}. As the simu-

lated model is not null, this SDP is indeed a type II error. Given α= 5% and

for each simulated data set, qCMB

α(s, p) is computed based on 1000 bootstrap

replications. Then we compute the empirical SDP based on 200 simulations.

The results are given in Table 3.

In this study, the design matrix is chosen so that there is a low-dimensional

linear dependency in the high-dimensional covariates. The collected covari-

ates are highly correlated when ris much smaller than p. It is known that

collinearity and high dimensionality add diﬃculty to the problem of variable

selection and deteriorate the performance of the LASSO. The smaller the

ris, the more severe the problem of collinearity becomes. As reﬂected in

Table 3, the empirical SDP increases as rdecreases, indicating that the cor-

relation between ﬁtted and observed responses is more likely to be smaller

than the spurious correlation.

Table 3

Empirical α-level spurious discovery probability (ESDP) based on 200 simulations when

p= 400,n= 100,120,160, and α= 5%.

Trule r= 120 r= 160 r= 200 r= 240 r= 280 r= 320 r= 360

n= 100 0.6950 0.6650 0.6000 0.5200 0.5100 0.4500 0.4000

n= 120 0.6600 0.5350 0.3350 0.3800 0.2500 0.2850 0.1950

n= 160 0.1950 0.1300 0.0500 0.0400 0.0550 0.0700 0.0250

6.4. Model selection. We demonstrate the idea in Section 5.2 through

the following toy example. Consider the linear model (4.1) with (n, p) =

(160,400) and β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp. The covariate

vector is taken to be X=Γx with x= (x1, . . . , x200)T, where x1, . . . , x200

are i.i.d. random variables following the continuous uniform distribution on

[−1,1] and Γis a 400×200 matrix satisfying ΓTΓ=I200. The noise variable

εfollows a standardized t-distribution with 4 degrees of freedom. Moreover,

let S0={j:β∗

j6= 0}be the true model.

24 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Applying ten-fold cross-validated LASSO selects 35 variables. Along the

solution path, we compute the number of correctly selected variables |b

S∩S0|,

the ﬁtted correlation, and the upper 5%-quantile of the multiplier bootstrap

approximation of the maximum spurious correlation based on 1000 boot-

strap samples. The results are provided in Table 4, from which we see that

the cross-validation procedure under the guidance of MSC selects 15 vari-

ables including all of the signal covariates.

Table 4

Number of true positive results, the sample correlation between ﬁtted and observed

responses, and the upper 5%-quantile of the multiplier bootstrap approximation based on

1000 bootstrap samples.

Trule |

Sλ∩S0|dcorrn(Y,

YpLASSO

λ)qCMB

0.05 (bsλ, p)

λ=0.3410

(bsλ=1) 1 0.3314 0.3040

λ=0.2703

(bsλ=2) 2 0.4802 0.3870

λ=0.2580

(bsλ=3) 3 0.5255 0.4435

λ=0.2351

(bsλ=4) 3 0.5536 0.4907

λ=0.2142

(bsλ=5) 3 0.5791 0.5297

λ=0.2044

(bsλ=6) 3 0.5971 0.5608

λ=0.1952

(bsλ=8) 3 0.6205 0.6131

λ=0.1778

(bsλ=9) 3 0.6377 0.6365

λ=0.1697

(bsλ=11) 4 0.6953 0.6758

λ=0.1620

(bsλ=14) 4 0.7380 0.7208

λ=0.1409

(bsλ=15) 40.7490 0.7346

λ=0.1345

(bsλ=19) 4 0.7685 0.7799

λ=0.0885

(bsλ=35) 4 0.8428 0.8847

6.5. Gene expression data. In this section, we extend the previous study

in Section 6.3 to an analysis of a real life data set. To further address the

question that for a given data set, whether the discoveries based on certain

data-mining technique are any better than spurious correlation, we consider

again the gene expression data from 90 individuals (45 Japanese and 45

Chinese, JPT-CHB) from the international ‘HapMap’ project [Thorisson et

al. (2005)] discussed in the introduction.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 25

The gene CHRNA6 is thought to be related to the activation of dopamine-

releasing neurons with nicotine, and therefore has been the subject of many

nicotine addiction studies [Thorgeirsson et al. (2010)]. We took the expres-

sions of CHRNA6 as the response Yand the remaining p= 47292 expres-

sions of probes as covariates X. For a given λ > 0, LASSO selects bsλprobes

indexed by b

Sλ. In particular, using ten-fold cross-validation to select the

tuning parameter gives bsλ0= 25 probes with λ0= 0.0674. Deﬁne ﬁtted

vectors b

YLASSO

λ=Xb

βLASSO

λand b

YpLASSO

λ=Xb

Sλb

βpLASSO

λ, where b

βLASSO

λis

the LASSO estimator and b

βpLASSO

λis the post-LASSO estimator, which is

the least-square estimator based on the LASSO selected set.

We depict the observed correlations between the ﬁtted value and the

response as well as the median and upper α-quantile of the multiplier boot-

strap approximation with α= 10% based on 1000 bootstrap replications in

Table 5. Even though dcorrn(Y,b

YLASSO)=0.8991 and dcorrn(Y,b

YpLASSO) =

0.9214, the discoveries appear to be no better than chance. We therefore in-

crease λ, which decreases the size of discovered probes. From Table 4, only

the discovery of three probes is above chance results at α= 10%. The three

probes are BBS1 – Homo sapiens Bardet-Biedl syndrome 1, POLE2 – Homo

sapiens polymerase (DNA directed), epsilon 2 (p59 subunit), and TG737 –

Homo sapiens Probe hTg737 (polycystic kidney disease, autosomal reces-

sive), transcript variant 2. Figure 4shows the observed correlations of the

ﬁtted values and observed values compared to the reference null distribution.

We now use the test statistic (5.5) to test whether the null hypothesis (1.2)

holds. We take λ0= 0.0674 and compute the observed test statistic b

Tobs

n,p =

4.6318. This corresponds to √ntimes the maximum correlation presented

in Figure 1. Using the null distribution provided by (4.9), which can be

estimated via the multiplier bootstrap, yields the p-value 0.001. Further,

using the SCAD gives b

Tobs

n,p = 4.1324 and a p-value 0.0164. Both calculations

are based on 5000 bootstrap replications. Therefore, the evidence against

the exogeneity assumption is very strong. Figure 4depicts the observed test

statistics relative to the null distribution.

7. Proofs. We ﬁrst collect several technical lemmas in Section 7.1 be-

fore proving our main result, Theorem 3.1 in Section 3. The proofs of Theo-

rems 3.2,4.1 and 4.2 are given in the supplemental material [Fan, Shao and

Zhou (2017)], where the proofs of Propositions 3.1 and 3.2 and Lemmas 7.2–

7.6 can also be found. Throughout, the letters C, C1, C2, . . . and c, c1, c2, . . .

denote generic positive constants that are independent of (s, p, n), whose

values may change from line to line.

26 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Table 5

Sample correlations between ﬁtted and observed responses, and the empirical median and

upper α-quantile of the multiplier bootstrap approximation based on 1200 bootstrap

samples when α= 10%.

Trule dcorrn(Y,

YLASSO

λ)dcorrn(Y,

YpLASSO

λ)qCMB

0.5(bsλ, p)qCMB

0.1(bsλ, p)

λ=0.1789

(bsλ=2) 0.6813 0.6879 0.5585 0.5988

λ=0.1708

(bsλ=3) 0.6915 0.7010 0.6555 0.6904

λ=0.1630

(bsλ=4) 0.7059 0.7260 0.7252 0.7554

λ=0.1556

(bsλ=5) 0.7141 0.7406 0.7797 0.8044

λ=0.1292

(bsλ=8) 0.7454 0.7641 0.8828 0.8988

λ=0.1177

(bsλ=14) 0.7714 0.8307 0.9658 0.9724

λ=0.1073

(bsλ=17) 0.8026 0.8739 0.9817 0.9860

λ=0.0933

(bsλ=21) 0.8451 0.9019 0.9915 0.9945

λ=0.0891

(bsλ=23) 0.8561 0.9109 0.9937 0.9966

λ=0.0674

(bsλ=25) 0.8991 0.9214 0.9953 0.9979

7.1. Technical lemmas. The following lemma combines Propositions 5.10

and 5.16 in Vershynin (2012).

Lemma 7.1.Let X1, . . . , Xnbe independent centered random variables

and write xn= (X1, . . . , Xn)T∈Rn. Then for every a= (a1, . . . , an)T∈Rn

and every t≥0, we have

(7.1) P|aTxn| ≥ t≤2 exp −cBmin t2

1|a|2

B1|a|∞

and

(7.2) P|aTxn| ≥ t≤2 exp −cH

2|a|2

2,

where Bv= max1≤i≤nkXikψvfor v= 1,2and cB, cH>0are absolute

constants.

Lemma 7.2.Let Conditions 2.1 and 2.2 be fulﬁlled. Write

Dn=Dn(s, p) := sup

α∈V αTb

Σnα/αTΣα−1and bσ2

ε=n−1

i=1

(εi−¯εn)2,

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 27

(a)

0.55 0.6 0.65 0.7 0.75

100

150

200

250

300

350 lambda = 0.1708, s = 3

(b)

0.9 0.92 0.94 0.96 0.98 1

200

400

600

800

1000

1200 lambda = 0.0674, s = 25

(c)

2 3 4 5

500

1000

1500

2000 Validating exogeneity via LASSO

(d)

2 3 4 5

500

1000

1500

2000 Validating exogeneity via SCAD

Fig 4: Top panel: Distributions of the spurious correlation b

Rn(s, p) estimated

by the bootstrap approximation for (a) s= 3 and (b) s= 25 and the sample

correlation between ﬁtted and observed responses (see Table 5). Red solid

lines are observed correlations and blue dash-dot lines mark the 90th per-

centile in (a) the median and (b) the distributions of the median. Bottom

panel: Null distributions for testing exogeneity (1.2) and its 95th percentile

(indicated by dash blue line) using bootstrap approximation (4.9) and ob-

served test statistics b

Tobs

n,p (indicated by solid red line) based on the residuals

of the LASSO and SCAD.

where b

Σn=n−1Pn

i=1(Xi−¯

Xn)(Xi−¯

Xn)Tand Vis as in (3.2). Then,

there exists a constant C1>0such that, for every t≥1,

Dn≤C1K2

1rs

nlog(γsep/s) + max rt

n, cn(s, p)t

n

(7.3)

holds with probability at least 1−8e−t, where cn(s, p) := slog(γsep/s)∨log n.

28 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Moreover, for every t > 0,

(7.4) bσ2

ε−1≤K2

0n−1t+ 4K2

0max n−1/2√t, n−1t

holds with probability greater than 1−2 exp(−cBt)−2 exp(−cHt), where

cB, cH>0are absolute constants as in (7.1)and (7.2).

The following results address the concentration and anti-concentration

phenomena of the supremum of the Gaussian process G∗indexed by F

(see (3.4)). In line with Chernozhukov, Chetverikov and Kato (2013), in-

equalities (7.5) and (7.6) below are referred to as the concentration and

anti-concentration inequalities, respectively.

Lemma 7.3.Let R∗(s, p) = supfα∈F fα(Z)/|α|Σfor F=F(s, p)given

in (3.2)and Zd

=N(0,Σ). Then there exists an absolute constant C > 0

such that, for every p≥2,1≤s≤pand t > 0,

PR∗(s, p)≥Cpslog(γsep/s) + t≤e−t2/2

(7.5)

and sup

x≥0

P|R∗(s, p)−x| ≤ t≤Ctpslog(γsep/s),(7.6)

where γs=pφmax(s)/pφmin(s).

Lemma 7.4.Suppose that a≥1and bj, cj>0for j= 1, . . . , m are pos-

itive constants. Let X1, . . . , Xmbe real-valued random variables that satisfy

P(|Xj| ≥ t)≤aexp{−t2/(2bj)},for t > 0, j = 1, . . . , m.

Then, for all m≥4/a, we have E(max1≤j≤m|Xj|)≤2plog(am) max1≤j≤mbj.

Furthermore, suppose that P(|Xj| ≥ t)≤aexp(−t/cj)holds for all t > 0

and j= 1, . . . , m. Then, for any m≥4/a, we have E(max1≤j≤m|Xj|)≤

{log(am)+1}max1≤j≤mcj.

To save space, we leave the proofs of Lemmas 7.2–7.4 to Appendix Ain

the supplemental material [Fan, Shao and Zhou (2017)].

7.2. Proof of Theorem 3.1.In view of (3.1), we have

Rn(s, p) = sup

α∈V

n−1Pn

i=1 αT(εiXi)−¯εnαT¯

(αTb

Σnα)1/2· {n−1Pn

i=1(εi−¯εn)2}1/2,

where Vis as in (3.2).

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 29

By Lemma 7.2, instead of dealing with b

Rn(s, p) directly, we ﬁrst investi-

gate the asymptotic behavior of its standardized counterpart given by

(7.7) Rn(s, p) = sup

α∈V

n−1

i=1

αT(εiXi)

|α|Σ

= sup

α∈V

n−1

i=1

αT

Σyi,

where yi=εiXi= (Yi1, . . . , Yip)Tare i.i.d. random vectors with mean zero

and covariance matrix Σ. Let Pybe the probability measure on Rpinduced

by y=εX. Further, deﬁne rescaled versions of b

Rn(s, p) and Rn(s, p) as

(7.8) b

Ln=b

Ln(s, p) = √nb

Rn(s, p), Ln=Ln(s, p) = sup

α∈V

n−1/2

i=1

αT

Σyi.

The main strategy is to prove the Gaussian approximation of Lnby the

supremum of a Gaussian process G∗indexed by Fwith covariance function

EG∗fα1G∗fα2=αT

1Σα2

|α1|Σ· |α2|Σ

,α1,α2∈ V.

Let Zbe a p-variate centered Gaussian random vector with covariance ma-

trix Σ. Then the aforementioned Gaussian process G∗can be induced by Z

in the sense that for every α∈ V,G∗fα=αT

ΣZ. The following lemmas show

that, under certain moment conditions, the distribution of Ln=√nRn(s, p)

can be consistently estimated by that of the supremum of the Gaussian pro-

cess G∗, denoted by R∗(s, p) = supα∈V G∗fα, and b

Lnand Lnare close. We

state them ﬁrst in the following two lemmas and prove them in Appendix F

of the supplemental material [Fan, Shao and Zhou (2017)].

Lemma 7.5.Under Conditions 2.1 and 2.2, there exists a random vari-

able T∗=T∗(s, p)d

= supα∈V αT

ΣZfor Zd

=N(0,Σ)such that, for any

δ∈(0, K0K1],

|Ln−T∗|.n−1c1/2

n(s, p) + K0K1n−3/2c2

n(s, p) + δ(7.9)

holds with probability at least 1−C∆n(s, p ;δ), where cn(s, p) = slog(γsep/s)∨

log nand

∆n(s, p ;δ) = (K0K1)3{sbn(s, p)}2

δ3√n+ (K0K1)4{sbn(s, p)}5

δ4n

with bn(s, p) = log(γsp/s)∨log n.

30 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Lemma 7.6.Let Conditions 2.1 and 2.2 hold. Assume that the sample

size satisﬁes n≥C1(K0∨K1)4cn(s, p). Then, with probability at least 1−

C2n−1/2c1/2

n(s, p),

Ln−Ln|.(K0∨K1)2K0K1n−1/2cn(s, p),(7.10)

where cn(s, p) = slog(γsep/s)∨log n.

Let bn(s, p) = log(γsp/s)∨log n. Applying Lemmas 7.5 and 7.6 with

δ=δn(s, p) = (K0K1)3/4min 1, n−1/8{sbn(s, p)}3/8

yields that, with probability at least 1 −C(K0K1)3/4n−1/8{sbn(s, p)}7/8,

Ln−T∗|.(K0K1)3/4n−1/8{sbn(s, p)}3/8.

Together with the inequality (7.6), this proves (3.3).

Further, using (3.2), (3.4) and the identity vTA−1v= maxα∈Ss−1(αTv)2

αTAα

that holds for any s×spositive deﬁnite matrix A, we ﬁnd that with prob-

ability one,

R∗(s, p) = max

S⊆[p]:|S|=smax

α∈Ss−1

αTZS

pαTΣSS α= max

S⊆[p]:|S|=sqZT

SΣ−1

SS ZS,(7.11)

where for each S⊆[p] ﬁxed, the second maximum over αis achieved when

α=Σ−1/2

SS ZS/|Σ−1/2

SS ZS|2, as for each p≥1 ﬁxed, all of the coordinates of

Zare non-zero almost surely. In particular, when Σ=Ip, the right-hand

side of (7.11) is reduced to maxS⊆[p]:|S|=s|ZS|2and therefore, {R∗(s, p)}2=

maxS⊆[p]:|S|=sPj∈SZ2

j=Z2

(p)+···+Z2

(p−s+1) happens with probability one.

This and (3.3) complete the proof of (3.5).

References.

Arlot, S., Blanchard, G. and Roquain, E. (2010). Some nonasymptotic results on

resampling in high dimension. I. Conﬁdence regions. Ann. Statist. 38 51–82.

Barrett, G. F. and Donald, S. G. (2003). Consistent tests for stochastic dominance.

Econometrica 71 71–104.

Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso

and Dantzig selector. Ann. Statist. 37 1705–1732.

B¨

uhlmann, P. and van de Geer, S. A. (2011). Statistics for High-Dimensional Data:

Methods, Theory and Applications. Springer, New York.

Brusco, M. J. and Stahl, S. (2005). Branch-and-Bound Applications in Combinatorial

Data Analysis. Springer, New York.

Cai, T. T. and Jiang, T. (2011). Limiting laws of coherence of random matrices with

applications to testing covariance structure and construction of compressed sensing

matrices. Ann. Statist. 39 1496–1525.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 31

Cai, T. T., Fan, J. and Jiang, T. (2013). Distributions of angles in random packing on

spheres. J. Mach. Learn. Res. 14 1837–1864.

Cai, T. T., Liu, W. and Xia, Y. (2014). Two-sample test of high dimensional means

under dependence. J. R. Stat. Soc. Ser. B 76 349–372.

Chang, J.,Zheng, C.,Zhou, W.-X. and Zhou, W. (2017). Simulation-based hypoth-

esis testing of high dimensional means under covariance heterogeneity. Biometrics. To

appear. DOI: 10.1111/biom.12695.

Chatterjee, S. and Bose, A. (2005). Generalized bootstrap for estimating equations.

Ann. Statist. 33 414–436.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations

and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann.

Statist. 41 2786–2819.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation

of suprema of empirical processes. Ann. Statist. 42 1564–1597.

Davydov, Yu. A.,Lifshits, M. A. and Smorodina, N. V. (1998). Local Properties of

Distributions of Stochastic Functionals. American Mathematical Society, Providence,

RI.

Dudoit, S. and van der Laan, M. J. (2007). Multiple Testing Procedures with Applica-

tions to Genomics. Springer, New York.

Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Test-

ing, and Prediction. IMS Monographs Vol 1. Cambridge University Press, Cambridge.

Fan, J., Guo, S. and Hao, N. (2012). Variance estimation using reﬁtted cross-validation

in ultrahigh dimensional regression. J. R. Stat. Soc. Ser. B 74 37–65.

Fan, J., Han, F. and Liu, H. (2014). Challenges of Big Data analysis. National Science

Review 1293–314.

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its

oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.

Fan, J. and Liao, Y. (2014). Endogeneity in high dimensions. Ann. Statist. 42 872–917.

Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional

feature space. Statist. Sinica 20 101–148.

Fan, J., Shao, Q.-M. and Zhou, W.-X. (2017). Supplement to “Are discoveries spurious?

Distributions of maximum spurious correlations and their applications.”

Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized

estimation. Ann. Statist. 42 819–849.

Goeman, J. J., van de Geer, S. A. and van Houwelingen, H. C. (2006). Testing

against a high dimensional alternative. J. R. Stat. Soc. Ser. B 68 477–493.

Hansen, B. E. (1996). Inference when a nuisance parameter is not identiﬁed under the

null hypothesis. Econometrica 64 413–430.

Hastie, T. J., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical

Learning: Data Mining, Inference, and Prediction (2nd ed). Springer, New York.

Shao, Q.-M. and Zhou, W.-X. (2014). Necessary and suﬃcient conditions for the

asymptotic distributions of coherence of ultra-high dimensional random matrices. Ann.

Probab. 42 623–648.

Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C.,

Ingle, C. E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavar´

e, S.,

Deloukas, P. and Dermitzakis, E. T. (2007). Population genomics of human gene

expression. Nature Genetics 39 1217–1224.

Thorgeirsson, T. E., et al. (2010). Sequence variants at CHRNB3-CHRNA6 and

CYP2A6 aﬀect smoking behavior. Nature Genetics 42 448–453.

Thorisson, G. A., Smith, A. V., Krishnan, L. and Stein, L. D. (2005). The Interna-

32 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

tional HapMap Project Web site. Genome Research 15 1592–1593.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc.

Ser. B 58 267–288.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical

Processes: With Applications to Statistics. Springer, New York.

Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.

In Compressed Sensing: Theory and Applications (Y. Eldar and G. Kutyniok, eds.)

210–268. Cambridge University Press, Cambridge.

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty.

Ann. Statist. 38 2109–2144.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood

models. Ann. Statist. 36 1509–1533.

Submitted to the Annals of Statistics

SUPPLEMENT TO “ARE DISCOVERIES SPURIOUS?

DISTRIBUTIONS OF MAXIMUM SPURIOUS

CORRELATIONS AND THEIR APPLICATIONS”

By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡

Princeton Universityk, Academy of Mathematics and Systems Science∗∗,

Chinese University of Hong Kong††

and University of California, San Diego‡‡

This supplemental material contains additional proofs for all the

remaining theoretical results in the main text, including Lemmas 7.2–

7.6, Theorems 3.2,4.1 and 4.2, and Propositions 3.1 and 3.2. A dis-

cussion on the moment assumptions is also included.

APPENDIX A: PROOF OF LEMMAS 7.2–7.6

Here we prove Lemmas 7.2–7.6 in Fan, Shao and Zhou (2015).

A.1. Proof of Lemma 7.2.For every α∈ V, recall that αΣ=α/|α|Σ

with |α|Σ= (αTΣα)1/2. Then, we have

αTb

Σα

αTΣα=n−1

i=1

(αT

ΣXi)2−(αT

Σ¯

Xn)2.

In view of this identity, we deﬁne

Dn,1= sup

α∈V n−1

i=1

(αT

ΣXi)2−1, Dn,2= sup

α∈V

(αT

Σ¯

Xn)2,

such that Dn≤Dn,1+Dn,2. In what follows, we bound the two terms Dn,1

and Dn,2respectively.

Let Gbe a class of functions Rp7→ Rgiven by G={x7→ gα(x) =

hαΣ,xi2:α∈ V}, and denote by PXthe probability measure on Rpinduced

by X. In this notation, we have Dn,1= supg∈G |n−1Pn

i=1 g(Xi)−PXg|. To

bound Dn,1, we follow a standard procedure: ﬁrst we show concentration of

Dn,1around its expectation EDn,1, and then upper bound the expectation.

To prove concentration, applying Theorem 4 in Adamczak (2008) implies

that there exists an absolute constant C > 0 such that, for every t > 0,

(A.1) Dn,1≤2EDn,1+ max 2σG

√t

n, C t

n



max

1≤i≤nsup

g∈G |g(Xi)|



ψ1

34 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

holds with probability at least 1−4e−t, where σ2

G:= supg∈G Pn

i=1 PXig2. Un-

der Condition 2.1, it follows from the fact |Σ1/2αΣ|2= 1 and the deﬁnition

of k·kψ2that

σ2

G≤nsup

α∈V

E{(Σ1/2αΣ)TU}4≤16nsup

α∈Sp−1kαTUk4

ψ2≤16K4

1n,(A.2)

which further leads to σG≤4K2

1√nfor K1as in Condition 2.1. In the last

term of (A.1), note that

sup

g∈G |g(Xi)|= sup

α∈V

(αT

ΣXi)2=sup

α∈V

αT

ΣXi2

For every ∈(0, γ−1/2

s), a standard argument can be used to prove that

there exists an -net Nof Vsuch that d=|N|≤{(2 + )ep/(s)}sand

(A.3) sup

α∈V

αT

ΣXi≤(1 −γs)−1max

α∈N

αT

ΣXi.

See, for example, the proof of (A.14) below. In particular, under Condi-

tion 2.1, using Lemma 2.2.2 in van der Vaart and Wellner (1996) implies by

taking s= (4γs)−1and N=Nsthat



max

1≤i≤nsup

g∈G |g(Xi)|



ψ1

=



max

1≤i≤nsup

α∈V{(Σ1/2αΣ)TUi}2



ψ1

.



max

1≤i≤nmax

α∈N{(Σ1/2αΣ)TUi}2



ψ1

.{slog(γsep/s)∨log n}sup

α∈Sp−1kαTUik2

ψ2

.K2

1cn(s, p),(A.4)

where cn(s, p) = slog(γsep/s)∨log n. Consequently, combining (A.1), (A.2)

and (A.4) yields, with probability at least 1 −4e−t,

(A.5) Dn,1≤2EDn,1+CK2

1max rt

n, cn(s, p)t

n.

To bound the expectation EDn,1, we use a result that involves the generic

chaining complexity, γm(T , d), of a semi-metric space (T, d). We refer to Ta-

lagrand (2005) for a systematic introduction. A tight upper bound for EDn,1

can be obtained by a direct application of Theorem A in Mendelson (2010).

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 35

To this end, note that supα∈V kαT

ΣXikψ1= supα∈V k(Σ1/2αΣ)TUikψ1≤K1

and for every α,α0∈ V,

k(αΣ−α0

Σ)TXikψ2=



{Σ1/2(αΣ−α0

Σ)}TUi



ψ2≤K1|αΣ−α0

Σ|Σ.

Successively, it follows from Theorem A in Mendelson (2010) and Theo-

rems 1.3.6, 2.1.1 in Talagrand (2005) that

EDn,1.K2

1γ2(F,|·|Σ)

√n+γ2

2(F,|·|Σ)

n.K2

1M(s, p)

√n+M2(s, p)

n,

(A.6)

where M(s, p) := E(supα∈V αT

ΣZ) with Zd

=N(0,Σ). In addition, a similar

argument to that leading to (A.3) can be used to show that

M(s, p)≤4

3Emax

α∈N αT

ΣZ≤2plog(|N|).pslog(γsep/s).(A.7)

Next we study Dn,2. Observe that pDn,2= supα∈V |n−1Pn

i=1 αT

ΣXi|.

Again, we use a concentration inequality due to Adamczak (2008). Theo-

rem 4 there implies that, for every t≥0,

(A.8) pDn,2≤2EpDn,2+ max 2σV

√t

n, C t

n



max

1≤i≤nsup

α∈V |αT

ΣXi|



ψ1

with probability at least 1 −4e−t, where σ2

V= supα∈V Pn

i=1 E(αT

ΣXi)2=n.

Under Condition 2.1, supα∈V kαT

ΣXikψ1≤supα∈V kαT

ΣXikψ2≤K1. Recall

that s= (4γs)−1, a similar argument to that leading to (A.4) gives



max

1≤i≤nsup

α∈V |αT

ΣXi|



ψ1

.



max

1≤i≤nsup

α∈N (Σ1/2αΣ)TUi



ψ2

.c1/2

n(s, p) sup

α∈Sp−1kαTUikψ2.K1c1/2

n(s, p).(A.9)

For the expectation EpDn,2, it follows from (A.3) with s= (4γs)−1that

EpDn,2=Esup

α∈V

αT

Σ¯

Xn≤4

3Emax

α∈N αT

Σ¯

Xn.

For α∈ V and t > 0, a direct consequence of (7.2) is that P(|αT

Σ¯

Xn| ≥ t)≤

2 exp(−cHnt2/K2

1). This, together with Lemma 7.4 and the previous display

implies

(A.10) EpDn,2.K1p(s/n) log(γsep/s).

36 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Together, (A.5)–(A.10) completes the proof of (7.3).

Finally, to prove (7.4), note that |bσ2

ε−1| ≤ |n−1Pn

i=1 ε2

i−1|+ ¯ε2

n. For

t1, t2≥0, applying (7.2) and (7.1) gives P(|¯εn| ≥ t1)≤2 exp(−cHnt2

1/K2

and P(|n−1Pn

i=1 ε2

i−1| ≥ t2)≤2 exp{−cBnmin(t2

2/A2, t2/A)}, respectively,

where A=kε2−1kψ1≤2kε2kψ1≤4kεk2

ψ2= 4K2

0. Consequently, taking

t1=K0n−1/2√tand t2= 4K2

0max(n−1/2√t, n−1t) proves (7.4).

A.2. Proof of Lemma 7.3. By (2.5), we have R∗(s, p) = supα∈V αT

ΣZ

and for every α∈ V,E(αT

ΣZ) = 0 and E(αT

ΣZ)2= 1. Consequently, in

view of (A.7), inequalities (7.5) and (7.6) follow from Borell’s inequality

(Proposition A.2.1 in van der Vaart and Wellner (1996)) and Lemma A.1 of

Chernozhukov, Chetverikov and Kato (2014a), respectively.

A.3. Proof of Lemma 7.4. Put B= max1≤j≤mbj. For any T > 0, we

have

Emax

1≤j≤m|Xj|=Z∞

Pmax

1≤j≤p|Xj|> tdt

≤T+Z∞

Pmax

1≤j≤m|Xj|> tdt

≤T+a

j=1 pbj·Z∞

T/√bj

exp(−t2/2) dt

≤T+a√Bm min pπ/2,√B/T exp{−T2/(2B)}.

In particular, this implies by taking T=p2Blog(am)≥√2Bthat

Emax

1≤j≤m|Xj|≤√Bp2 log(am) + {2 log(am)}−1/2

≤√2 + 1

√2 log 4 pBlog(am).

A completely analogous argument will lead to the desired bound under the

condition that P(|Xj| ≥ t)≤aexp(−t/cj) for all t > 0 and j= 1, . . . , m.

A.4. Proof of Lemma 7.5. Recall that Zd

=N(0,Σ) and write Wn=

n−1/2Pn

i=1 yiwith yi=εiXi, such that Ln= supα∈V hαΣ,Wnifor Lnas

in (7.8).

To prove (7.9), a new coupling inequality for maxima of sums of random

vectors in Chernozhukov, Chetverikov and Kato (2014a) plays an impor-

tant role in our analysis. We divide the proof into three steps. First we

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 37

discretize the index space V=V(s, p) using a net, Vε, via a standard cov-

ering argument. Then we apply the aforementioned coupling inequality to

the discretized process, and ﬁnish the proof based on the concentration and

anti-concentration inequalities for Gaussian processes.

Step 1: Discretization. The goal is to establish (A.14), which approxi-

mates the supremum over an inﬁnite index space Vby the maximum over

its -net V.

Let Rpbe equipped with the Euclidean metric ρ(x,y) = |x−y|2for x,y∈

Rp. Subsequently, the induced metric on the space of all linear functions

x7→ fα(x) = hα,xiis deﬁned as ρ(fα, fβ) = supx∈Sp−1|fα(x)−fβ(x)|=

supx∈Sp−1|hα−β,xi| =|α−β|2. For every ∈(0,1), denote by N(V, ρ, )

the -covering number of (V, ρ). For the unit Euclidean sphere Sp−1equipped

with the Euclidean metric ρ, it is well-known that N(Sp−1, ρ, )≤(1+ 2/)p.

Together with the decomposition

(A.11) α∈Sp−1:|α|0=s=[

S⊆[p]:|S|=sα∈Sp−1: supp(α) = S

and the binomial coeﬃcient bound p

s≤(ep/s)s, this yields

(A.12) N(V, ρ, )≤p

s(1 + 2/)s≤ {(2 + )ep/(s)}s.

For ∈(0,1) and S⊆[p] ﬁxed, let NS, be an -net of the unit ball in

(RS, ρ) with |NS,| ≤ (1 + 2/)s. Thus the function class N:= ∪S⊆[p]NS, =

{x7→ hα,xi:α∈ NS,, S ⊆[p]}forms an -net of (V, ρ). Denote by d=

d=|N|the cardinality of N. Then it is easy to see that d≤p

s(1 + 2/)s

and hence log d.slog{ep/(s)}.

For every α∈ V with supp(α) = S, there exists some α0∈ NS, satisfying

that supp(α0) = supp(α) and |α−α0|2≤. Further, note that

|αΣ−α0

Σ|2

Σ= 2 −2hα,Σα0i

|α|Σ|α0|Σ

=hα−α0,Σ(α−α0)i − (|α|Σ− |α0|Σ)2

|α|Σ|α0|Σ≤γ2

s|α−α0|2

2,(A.13)

from which we obtain

hαΣ,Wni=αΣ−α0

Σ,Wn+α0

Σ,Wn

=|αΣ−α0

Σ|Σh(αΣ−α0

Σ)/|αΣ−α0

Σ|2,Wni

|(αΣ−α0

Σ)/|αΣ−α0

Σ|2|Σ

+α0

Σ,Wn

≤γssup

α∈V hαΣ,Wni+α0

Σ,Wn

38 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

for γsas in (2.4), and hence

sup

α∈Sp−1: supp(α)=ShαΣ,Wni ≤ γssup

α∈V hαΣ,Wni+ max

α∈NS, hαΣ,Wni.

Taking maximum over S⊆[p] with |S|=son both sides yields

sup

α∈V hαΣ,Wni ≤ γssup

α∈V hαΣ,Wni+ max

α∈NhαΣ,Wni.

Therefore, as long as ∈(0, γ−1

s),

(A.14) max

α∈NhαΣ,Wni ≤ Ln≤(1 −γs)−1max

α∈NhαΣ,Wni.

Step 2: Coupling. This aims to carry the Gaussian approximation over

the discrete index set Vand to establish (A.15) or its more explicit bound

(A.20).

Write V={αj:j= 1, . . . , d}and let V1,...,Vnbe i.i.d. d-variate ran-

dom vectors such that Vi= (Vi1, . . . , Vid)T, where Vij =αT

j,Σyisatisﬁes

that E(Vij) = 0 and E(V2

ij) = 1. Deﬁne the d-variate Gaussian random vec-

tor G= (G1, . . . , Gd)T, where Gj=αT

j,ΣZ=αT

jZ/|αj|Σfor j= 1, . . . , d.

Note that, for each 1 ≤j6=`≤d,E(V1jV1`) = E(GjG`). By Corollary 4.1 of

Chernozhukov, Chetverikov and Kato (2014a), there exists a random vari-

able T∗



= max1≤j≤dGjsuch that, for every δ > 0,

Pmax

1≤j≤dn−1/2

i=1

Vij −T∗

≥16δ

.B1

log(dn)

δ2n+B2{log(dn)}2

δ3n3/2+B3{log(dn)}3

δ4n2+log n

n,(A.15)

where

B1=Emax

1≤j,`≤d

i=1 VijVi` −EVij Vi`,

B2=Emax

1≤j≤d

i=1 |Vij|3, B3=

i=1

Emax

1≤j≤dV4

ij.

In what follows, we bound the three terms B1–B3respectively.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 39

First, by Lemma 2.2.2 in van der Vaart and Wellner (1996) we have

Emax

1≤j≤dV4

ij=E(ε4

i)·Emax

1≤j≤d(αT

j,ΣXi)4

≤16E(ε4

i)·



max

1≤j≤d(Σ1/2αj,Σ)TUi



ψ2

.v4(log d)2·sup

α∈Sp−1kαTUik4

ψ2

.v4K4

1(log d)2

for v4=E(ε4) as in Condition 2.1, leading to

(A.16) B3.v4K4

1n[slog{ep/(s)}]2.

For B2, we apply Lemma 9 in Chernozhukov, Chetverikov and Kato (2015)

to obtain

B2.max

1≤j≤d

i=1

E|Vij|3+ (log d)·Emax

1≤i≤nmax

1≤j≤d|Vij|3.

For every integer q≥1, by the deﬁnition of the k·kψ2norm we have

E|Vij|q=vqE(Σ1/2αj,Σ)TUiq≤qq/2vq

(Σ1/2αj,Σ)TUi

q

ψ2≤qq/2vqKq

and once again, it follows from Lemma 2.2.2 in van der Vaart and Wellner

(1996) that

Emax

1≤i≤nmax

1≤j≤d|Vij|q≤qq



max

1≤i≤nmax

1≤j≤d|εi|·|αT

j,ΣXi|



ψ1

.qq{log(dn)}qmax

1≤j≤d



ε·(Σ1/2αj,Σ)TU



q

ψ1

.qq{log(dn)}qkεkq

ψ2max

1≤j≤d



(Σ1/2αj,Σ)TU



q

ψ2

.qq(K0K1)q{log(dn)}q.

The last three displays together imply by taking q= 3 that

(A.17) B2.v3K3

1n+ (K0K1)3[slog{ep/(s)}]4.

Turning to B1, a direct consequence of Lemma 1 in Chernozhukov, Chetverikov

and Kato (2015) is that

B1.(log d)1/2max

1≤j≤dn

i=1

EV4

ij1/2

+ (log d)·Emax

1≤i≤n,1≤j≤dV4

ij1/2

.v1/2

4K2

1[ns log{ep/(s)}]1/2+ (K0K1)2[slog{ep/(s)} ∨ log n]3.(A.18)

40 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Putting (A.15)–(A.18) together, we obtain that for every δ > 0 and ∈

(0,1),

Pmax

α∈N

αT

ΣWn−T∗

≥16δ

.v1/2

4K2

c3/2

n(s, p, )

δ2√n+v3K3

n(s, p, )

δ3√n+v4K4

n(s, p, )

δ4n

+ (K0K1)2c4

n(s, p, )

δ2n+ (K0K1)3c6

n(s, p, )

δ3n3/2+log n

n,(A.19)

where cn(s, p, ) := slog{ep/(s)} ∨ log n. Because this upper bound is only

meaningful when it is less than 1, it can be further reduced to

bn(s, p, , δ) := (K0K1)3c2

n(s, p, )

δ3√n+ (K0K1)4c5

n(s, p, )

δ4n

(A.20)

for ∈(0,1) and δ∈(0, K0K1].

Step 3. For every ∈(0,1), put s=γs. A similar argument to that

leading to (A.14) now gives

max

α∈N

αT

ΣZ≤sup

α∈V

αT

ΣZ≤ssup

α∈V

αT

ΣZ+ max

α∈N

αT

ΣZ.

Further, it is concluded from (7.5) and (A.30) in the proof of Lemma 7.6

that, with probability at least 1 −Cn−1,

sup

α∈V

αT

ΣZ−max

α∈N

αT

ΣZ.c1/2

n(s, p)s

(A.21)

with cn(s, p) = slog(γsep/s)∨log n, and

sup

α∈V

αT

ΣWn−max

α∈N

αT

ΣWn

.c1/2

n(s, p) + K0K1n−1/2c2

n(s, p)s.(A.22)

For the Gaussian maxima supα∈V hαΣ,Ziand maxα∈NhαΣ,Zi, it follows

from (A.21) that for any Borel subset Bof R,

PT∗

∈ B≤Psup

α∈V

αT

ΣZ∈ BCsc1/2

n(s,p)+n−1,

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 41

where Bu:= {x∈R:|x−y| ≤ u, ∀y∈ B} for u > 0. This, together

with Lemma 4.1 in Chernozhukov, Chetverikov and Kato (2014a), a vari-

ant of Strassen’s theorem, implies that there exits a random variable T∗d

supα∈V hαΣ,Zisuch that

(A.23) PT∗−T∗

> C sc1/2

n(s, p)≤n−1.

Finally, assembling (A.19)–(A.22) completes the proof of (7.9) by taking

= (γsn)−1.

A.5. Proof of Lemma 7.6. Let Sn=n−1Pn

i=1 XiXT

iand write

(A.24) Dn= sup

α∈V αT

Σb

ΣnαΣ−1= sup

α∈V αT

ΣSnαΣ−1−αT

Σ¯

Xn2.

For ease of exposition, deﬁne bσ2

ε=n−1Pn

i=1(εi−¯εn)2and for α∈Rp, let

(A.25) bσ2

α=bσ2

α(Σ) = n−1

i=1

(αT

ΣXi)2−(αT

Σ¯

Xn)2.

In this notation, Dn= supα∈V |bσ2

α−1|and

Ln= sup

α∈V

(bσεbσα)−1n−1/2

i=1

εi·αT

ΣXi−√n¯εnαT

Σ¯

Xn.

Comparing this with Lnin (7.8), it is easy to see that

(A.26) |b

Ln−Ln| ≤ sup

α∈V (bσεbσα)−1−1·Ln+√nbσ−1

ε|¯εn| · sup

α∈V bσ−1

α|αT

Σ¯

Xn|.

In what follows, we bound the two terms on the right-hand side of (A.26)

separately, starting with the ﬁrst one.

For every t > 0, let EX(t) and Eε(t) be the events that (7.3) and (7.4)

hold, respectively. In particular, taking t1=A1log nfor A1>0 and t2=

min[log n, {n/cn(s, p)}1/2] yields P{Eε(t1)c} ≤ 2n−cHA1+ 2n−cBA1and

PEX(t2)c≤8 exp − {n/cn(s, p)}1/2≤3n−1/2c1/2

n(s, p),

where cH, cB>0 are as in Lemma 7.1. Here, the last step comes from the

inequality supt≥0te−t≤e−1. On the event Eε(t1)∩ EX(t2),

bσ2

ε−1.K2

0rlog n

n≤1

2, Dn.K2

1rcn(s, p)

n≤1

(A.27)

42 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

whenever the sample size nsatisﬁes n&max{K4

0log n, K4

1cn(s, p)}. To-

gether, (A.26), (A.27) and the identity

1−(bσεbσα)−1

= (bσεbσα)−1(bσε−1)(bσα−1) + bσε−1 + bσα−1

=(bσ2

ε−1)(bσ2

α−1) + (bσ2

ε−1)(bσα+ 1) + (bσ2

α−1)(bσε+ 1)

bσεbσα(bσε+ 1)(bσα+ 1)

imply, on Eε(t1)∩ EX(t2) with nsuﬃciently large,

sup

α∈V (bσεbσα)−1−1.(K0∨K1)2n−1/2c1/2

n(s, p).(A.28)

Next we deal with Ln, which can be written as supα∈V n−1/2Pn

i=1 αT

Σyi,

where yi=εiXisatisﬁes that, under Condition 2.1,

(A.29) E(αT

Σyi)2= 1 for all α∈ V and sup

α∈V kαT

Σyikψ1≤2K0K1.

As in the proof of (A.8), using Theorem 4 in Adamczak (2008) gives, for

any t≥0,

(A.30) Ln≤2ELn+ max 2√t, C t

√n



max

1≤i≤nsup

α∈V |αT

Σyi|



ψ1

holds with probability at least 1−4e−t. For the last term of (A.30), a similar

argument to that leading to (A.8) gives, on this occasion with s= (4γs)−1

and N=Nsthat



max

1≤i≤nsup

α∈V |αT

Σyi|



ψ1

=



max

1≤i≤nsup

α∈V εi·(Σ1/2αΣ)TUi



ψ1

.



max

1≤i≤nsup

α∈N εi·(Σ1/2αΣ)TUi



ψ1

.{slog(γsp/s) + log n}kεkψ2sup

α∈Sp−1kαTUikψ2

.K0K1cn(s, p).

Here, we used the property that the cardinality of the s-net Nof V, denoted

by d=|N|, is such that log d.slog(γsep/s).

To bound ELn, observe that for every u∈Rp, supα∈V |αT

Σu|= supα∈V αT

Σu.

For Wn=n−1/2Pn

i=1 yi, it follows from (A.14) with s= (4γs)−1that

ELn=Esup

α∈V

αT

ΣWn≤4

3Emax

α∈N αT

ΣWn.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 43

For every α∈ V and t > 0, from (7.1) and (A.29) we get

PαT

ΣWn≥t≤2 exp −cBmin t2

4K2

0K2

,√nt

2K0K1.(A.31)

Hence, applying Lemma 7.3 with slight modiﬁcation gives

Emax

α∈N αT

ΣWn.K0K1plog d+K0K1n−1/2log d.

Plugging this into (A.30) and taking t= min[log n, {n/cn(s, p)}1/2] imply,

with probability at least 1 −4 exp[−{n/cn(s, p)}1/2]≥1−2n−1/2c1/2

n(s, p),

(A.32) Ln.K0K1c1/2

n(s, p)

whenever the sample size nsatisﬁes n&cn(s, p).

Again, on the event Eε(t1)∩ EX(t2) with nsuﬃciently large as above for

(A.27), the second term on the right-hand side of (A.26) is bounded by some

multiple of √nbσ−1

ε|¯εn|pDn,2, where Dn,2is as in (A.8). Arguments similar

to those in the proof of Lemma 7.2 permit us to show that, with probability

at least 1 −2n−1/2c1/2

n(s, p),

(A.33) pDn,2.K1n−1/2c1/2

n(s, p).

Further, put Sn=Pn

i=1 εi, V 2

n=Pn

i=1 ε2

i, such that √nbσ−1

ε¯εn=V−1

nSn{1−

n−1(Sn/Vn)2}1/2. Then it follows from Theorems 2.16 and 2.19 in de la Pe˜na,

Lai and Shao (2009) that for every t∈(0,√n]

P√nbσ−1

ε|¯εn| ≥ t

≤P|Sn| ≥ t(1 + t2n−1)−1/2Vn

≤P|Sn| ≥ t(1 + t2n−1)−1/2(4√2 + 1)−1(4√n+Vn)+PV2

n≤n/2

≤4 exp(−cSNt2) + exp{−n/(8v4)},

where v4=E(ε4) and cSN >0 is an absolute constant. In particular, taking

t=A2(log n)1/2with A2>0 yields, with probability greater than 1 −

4n−cSNA2

2−exp{−n/(8v4)},

(A.34) √nbσ−1

ε|¯εn|.plog n.

Finally, combing (A.26), (A.28), (A.32), (A.33) and (A.34) completes the

proof of (7.10).

44 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

APPENDIX B: PROOF OF THEOREM 3.2

We divide the proof into three key steps. The ﬁrst step is to establish

(B.1) using the results on discretization in the proof of Theorem 3.1, and

then analyze separately the order of the stochastic terms (B.2) and (B.3).

Step 1. For any α∈Rp, let kαk2

n=αTb

Σαwith b

Σ=b

Σn, and let γs=

γs(Σ) be as in (2.4). First, we prove that there exits a (γsn)−1-net of V,

denoted by Vn=Vn(s, p), such that log(|Vn|).sbn(s, p) and

supt≥0Psupα∈V hαΣ,Zi ≤ t−Psupα∈V hαn,Zni ≤ tXn

.hbγs(γsn)−1s¯

bn(s, p)+∆1/3

ns¯

bn(s, p) + log(1/∆n)2/3i,(B.1)

where αn=α/kαkn,Xn={Xi}n

i=1,bn(s, p) = log(γsp/s)∨log n,¯

bn(s, p) =

log(¯γsp/s)∨log nwith ¯γs= max(γs,bγs), and

(B.2) bγs:= γs(b

Σ) = maxu∈Sp−1:1≤|u|0≤skukn

minu∈Sp−1:1≤|u|0≤skukn

denotes the s-sparse condition number of b

Σand

(B.3) ∆n= ∆n(s, p) = max

α,β∈VnαT

ΣΣβΣ−αT

Σβn

with αn=α/kαknand βn=β/kβkn.

Proof of (B.1). As in the proof of Lemma 7.5 in Appendix A.4, for every ∈

(0,1), there exists an -net Nof Vsatisfying d=|N|≤{(2 + )ep/(s)}s,

such that

sup

α∈V hαΣ,Zi − max

α∈NhαΣ,Zi≤γssup

α∈V hαΣ,Zi,(B.4)

sup

α∈V hαn,Zni − max

α∈Nhαn,Zni≤bγssup

α∈V hαn,Zni.(B.5)

For notational convenience, write d=d,N={α1,...,αd}and let

G= (G1, . . . , Gd)T= (hα1,Σ,Zi,...,hαd,Σ,Zi)T,

Gn= (Gn1, . . . , Gnd)T= (hα1,n,Zni,...,hαd,n,Zni)T

be two d-dimensional centered Gaussian random vectors. Conditional on Xn,

applying Theorem 2 in Chernozhukov, Chetverikov and Kato (2015) to G

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 45

and Gnrespectively gives

sup

t∈RPmax

1≤j≤dGj≤t−Pmax

1≤j≤dGnj ≤tXn

.(∆ log d)1/3log d+ log(1/∆)1/3,

where ∆ = ∆(N) = maxα,β∈NαT

ΣΣβΣ−αT

Σβn.

By Lemma 7.3, we have for every t > 0,

Psup

α∈V hαΣ,Zi ≥ Cpslog(γsep/s) + t≤e−t2/2,

Psup

α∈V hαn,Zni ≥ Cpslog(bγsep/s) + tXn≤e−t2/2.

The last three displays, together with (B.4) imply that, for every ∈(0,1)

and t > 0,

Psup

α∈V hαn,Zni ≤ tXn

≤Pmax

α∈Nhαn,Zni ≤ tXn

≤Pmax

α∈NhαΣ,Zi ≤ t+C∆1/3(log d)1/3{log(d/∆)}1/3

≤Psup

α∈V hαΣ,Zi ≤ t+Cγsc1/2

n(s, p)

+C∆1/3(log d)1/3{log(d/∆)}1/3+n−1

≤Psup

α∈V hαΣ,Zi ≤ t+Cγscn(s, p)

+C∆1/3(log d)1/3{log(d/∆)}1/3+n−1,

where cn(s, p) = slog(γsep/s)∨log nand the last inequality comes from

(7.6). For the lower bound, in view of (B.5), it can be similarly obtained

that

Psup

α∈V hαn,Zni ≤ tXn

≥Pmax

α∈Nhαn,Zni ≤ t−Cbγspslog(bγsep/s)∨log nXn−n−1

≥Psup

α∈V hαΣ,Zi ≤ t−Cbγs{slog(bγsep/s)∨log n}

−C∆1/2(log d)1/3{log(d/∆)}1/3−n−1.

46 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Taking Vn=Nwith = (γsn)−1proves (B.1).

Step 2. Next, we study ∆nin (B.3), which is bounded by

max

α,β∈VnαT

Σ(b

Σ−Σ)βΣ+ max

α,β∈VnαT

Σβn−αT

Σb

ΣβΣ:= ∆n,1+ ∆n,2.

(B.6)

In what follows, we bound the two terms on the right side separately, starting

with ∆n,2. For every α,β∈ Vn,

αT

Σβn−αT

Σb

ΣβΣ

=αT

Σ(βn−βΣ) + αT

ΣβΣ−αT

Σb

ΣβΣ

≤ kαnknkβn−βΣkn+kβΣ−βnknkαΣ−αnkn+kβnknkαn−αΣkn

≤kαΣkn−1+kβΣkn−1+kαΣkn−1·kβΣkn−1.

Using this together with Lemma 7.2 yields, with probability at least 1 −

3n−1/2c1/2

n(s, p),

(B.7) ∆n,2.K2

1n−1/2c1/2

n(s, p).

Turning to ∆n,1, it suﬃces to focus on

(B.8)

max

α,β∈VnαT

Σ(Sn−Σ)βΣ= max

α,β∈Vnn−1

i=1 (αT

ΣXi)(βT

ΣXi)−αT

ΣΣβΣ.

Applying Theorem 4 in Adamczak (2008) we obtain that, with probability

at least 1 −4e−t,

max

α,β∈VnαT

Σ(Sn−Σ)βΣ

≤2Emax

α,β∈VnαT

Σ(Sn−Σ)βΣ

+ max 2σVn

√t

n+Ct

n



max

1≤i≤nmax

α,β∈Vn

αT

ΣXiXT

iβΣ



ψ1,(B.9)

where similarly to (A.2) and (A.4),

σ2

Vn:= max

α,β∈Vn

i=1

E(αT

ΣXiβT

ΣXi)2

≤max

α,β∈Vn

i=1 E(αT

ΣXi)41/2E(βT

ΣXi)41/2≤16K4

1n(B.10)

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 47

and

(B.11) 



max

1≤i≤nmax

α,β∈Vn

αT

ΣXiXT

iβΣ



ψ1

.K2

1cn(s, p).

From the moment inequality kαT

ΣXiβT

ΣXikψ1≤2K2

1, another consequence

of (7.1) is that

Pn−1

i=1 (αT

ΣXi)(βT

ΣXi)−αT

ΣΣβΣ≥t

≤2 exp{−cBnmin(t2/K4

1, t/K2

1)}.

Using this together with Lemma 7.4 we get

(B.12) Emax

α,β∈VnαT

Σ(Sn−Σ)βΣ.K2

1rlog(|Vn|)

n+K2

log(|Vn|)

Consequently, combining (B.8)–(B.12) gives, with probability at least 1 −

2n−1/2c1/2

n(s, p),

(B.13) ∆n,1.K2

1n−1/2{sbn(s, p)}1/2.

Together, (B.6), (B.7) and (B.13) imply that, with probability at least 1 −

5n−1/2c1/2

n(s, p),

(B.14) ∆n.K2

1n−1/2{sbn(s, p)}1/2.

Step 3. Finally, we study the sample s-sparse condition number bγsin (B.2).

For every u∈Rp, note that (kukn/|u|Σ)2=uT

Σb

ΣuΣ= 1 + uT

Σb

ΣuΣ−1. For

every ∈(0,1), in view of the inequality Ps

j=1 p

j≤(ep/s)sthat holds for

all 1 ≤s≤p, there exists an -net of {x7→ hu,xi:u∈Sp−1,1≤ |u|0≤s}

with its cardinality bounded by {(2 + )ep/(s)}s. Consequently, it follows

from Lemma 7.2 and the previous display that, with probability at least

1−Cn−1/2c1/2

n(s, p),

2≤kukn/|u|Σ2≤3

2for all u∈Sp−1satisfying 1 ≤ |u|0≤s(B.15)

and hence, bγs≤3γswhenever nsatisﬁes n&K4

1cn(s, p).

Assembling (B.1), (B.14) and (B.15) completes the proof of Theorem 3.2.

48 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

APPENDIX C: DISCUSSION ON THE MOMENT ASSUMPTIONS

As pointed out in Remark 3.3, the sub-exponential rate, i.e. log pnc

with some c∈(0,1), requires a sub-Gaussian condition on the sampling

distribution. In the following, we will discuss the main steps on how our

analysis can be carried over under ﬁnite moment conditions, at the cost of

imposing more stringent constraints on the dimension pas a function of the

sample size n.

Note that, inequality (A.15) in the proof of Lemma 7.5 holds with B1–B3

well-deﬁned as long as the fourth moments of all coordinates of εXare ﬁnite.

From the proof of Lemma 7.6 we see that the main diﬃculty comes from

bounding

Ln=Ln(s, p) = sup

α∈V

√n

i=1 hαΣ, εiXii

with αΣ=α/|Σ1/2α|2and

sup

α∈V |αTb

Σα−αTΣα|,

where V=V(s, p) = {α∈Sp−1:|α|0=s}for 1 ≤s≤p. Without loss of

generality, we let b

Σ=b

Σn=n−1Pn

i=1 XiXT

i, where X1,...,Xnare i.i.d.

copies of a random vector X∈Rp.

Deviation bounds for the above two terms are given in the following lem-

mas. Comparing (C.1) and (C.3) with (A.32) and (7.3), respectively, we see

that the consistency of normal approximation requires signiﬁcantly more

stringent condition on the dimension punder ﬁnite fourth moment assump-

tions. In this case, the convergence in Kolmogorov distance

sup

t≥0P√nb

Rn(s, p)≤t−PR∗(s, p)≤t

holds when slog(p) = o(log n) as n→ ∞. Because our main focus is on

characterizing spurious discoveries from variable selection methods for high-

dimensional data with low-dimensional structure, the above result becomes

less instructive and is not applicable to the statistical problems considered

in this paper. Nonetheless, the study of distributional approximation for

heavy-tailed data has its own interest and is also our ongoing work [Sun,

Zhou and Fan (2016)].

Condition C.1.The random variable εsatisﬁes Eε= 0,Eε2= 1 and

R0=Eε4<∞. There exists a random vector Usuch that X=Σ1/2U,

E(U) = 0,E(UUT) = Ipand R1= supu∈RpEhu,Ui4<∞.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 49

Lemma C.1.Assume that Condition C.1 holds. Then, for any δ∈(0,1),

Ln≤Cmax{(R0R1)1/4n−1/4,1}(5eγsp/s)s/4δ−1/4

(C.1)

with probability greater than 1−δ, where C > 0is an absolute constant.

Proof of Lemma C.1.As before, deﬁne Wn=n−1/2Pn

i=1 yiwith

yi=εiXifor i= 1, . . . , n. By (A.14), for any ∈(0, γ−1

s), there exists

a ﬁnite set N⊆ V such that |N| ≤ p

s(1 + 2/)sand

Ln≤(1 −γs)−1max

α∈N|hαΣ,Wni|.(C.2)

For every α∈ Nand t > 0, by Markov’s inequality and the Rosenthal

inequality we have

P(|hαΣ,Wni| > t)

≤t−4EhαΣ,Wni4

t4n2n

i=1

EhαΣ,yii4+n

i=1

EhαΣ,yii22

.R0R1t−4n−1+t−4.

This, together with (C.2) with = (2γs)−1and the union bound implies that

for any δ∈(0,1), Ln.max{(R0R1)1/4n−1/4,1}δ−1/4with probability at

least 1 −(5eγsp/s)sδ. By a simple algebra, we obtain (C.1) as required.

Lemma C.2.Assume that Condition C.1 holds. Then, there exists some

absolute constants C > 0such that, for any δ∈(0,1),

sup

α∈V |αTb

Σα−αTΣα| ≤ C φmax(s)pR1s1/4r(8ep/s)s

δn

(C.3)

with probability greater than 1−δ, where φmax(s)is the s-sparse maximal

eigenvalue of Σ.

Proof of Lemma C.2.For every α∈ V, deﬁne

Q(α) = αTb

Σα=1

i=1

(αTXi)2and Q(α) = αTΣα.

Applying Chebyshev’s inequality to the quadratic form b

Q(α), we obtain

that for every δ > 0,

P|b

Q(α)−Q(α)| ≥ rVar(hα,Xi2)

δn ≤δ.(C.4)

50 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

In view of Proposition 6.2 in Catoni (2012), this upper bound is tight un-

der the ﬁnite fourth moment condition. Moreover, for any ∈(0,1/2),

it follows from Lemma 2 in the supplement to Wang, Berthet and Sam-

worth (2016) that there exists N⊆ V with cardinality at most π(1 −

2/16)−(s−1)/2√sp

s(2/)s−1such that

sup

α∈V |b

Q(α)−Q(α)| ≤ (1 −2)−1max

α∈N|b

Q(α)−Q(α)|.(C.5)

Together, (C.4) and (C.5) with = 1/4 yield

Psup

α∈V |b

Q(α)−Q(α)| ≥ rsupα∈V Var(hα,Xi2)

δn .√s(8ep/s)sδ.

Combining this with the fact that

hα,Xi2=hΣ1/2α,Ui2=αTΣα·Σ1/2α

|Σ1/2α|2

,U2

proves (C.3).

APPENDIX D: PROOF OF THEOREM 4.1

Without loss of generality, we assume that σ2=E(ε2

i) = 1 and s≤n≤

p≤en. The dependence of b

Roracle

non pwill be assumed without displaying.

Let bεi=bεoracle

i,b

β= (b

βT

1,b

βT

2)T=b

βoracle and δ= (δT

1,δT

2)T=b

β−β∗∈

Rpwith δ1=b

β1−β1∈Rsand δ2=b

β2−β2=0. As in the proof of

Theorem 3.1, we ﬁrst consider the following standardized version of b

Roracle

n= max

j∈[p]n−1

i=1 bεiXij,(D.1)

Recall that δ= (δT

1,δT

2)Twith δ2=0. For every j∈[p], from the identity

1X1δ1=XT

1εwe derive that

(D.2) Σ11δ1=n−1XT

1ε+b1with b1:= −n−1XT

1X1−Σ11δ1.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 51

Together with (4.4) and some simple algebra, this implies

n−1

i=1 bεiXij

=n−1ej(p)TXTε−XTXδ

=n−1

i=1

ej(p)TPXiεi

−ej(p)T0s×1

Σ21Σ−1

11 b1+n−1XT

2X1−Σ21δ1,(D.3)

where ej(p) = (0,...,0,1,0,...,0)Tis the unit vector in Rpwith 1 on the

jth position and

P=0s×s0s×d

−Σ21Σ−1

11 Id∈Rp×p.(D.4)

In view of (D.3), we deﬁne

(D.5) e

Rn= max

j∈[p]n−1

i=1

ej(p)TPXiεi.

Together, (D.1), (D.2), (D.3) and (D.5) imply

Roracle

n−e

Rn

≤max

j∈[d]ej(d)TΣ21 Σ−1/2

11 n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −IsΣ1/2

11 δ1

+ max

j∈[p]\[s]n−1

i=1 XijXT

i,1Σ−1/2

11 −ej−s(d)TΣ21Σ−1/2

11 Σ1/2

11 δ1

≤max

j∈[d]Σ−1/2

11 Σ12 ej(d)2

n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −Is

Σ1/2

11 δ12

+√sΣ1/2

11 δ12max

j∈[p]\[s]

k∈[s]n−1

i=1

(Id −E)Xij ek(s)TΣ−1/2

11 Xi,1

:= Q1+Q2,(D.6)

where (Id −E)Y:= Y−EYfor any random variable Y.

With the above preparations, the rest of the proof involves three steps:

First, we prove the Gaussian approximation of √ne

Rnby the Gaussian

maximum e

R∗:= |e

Z|∞, where e

=N(0,Σ22.1). Second, we prove that

52 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

√n(Q1+Q2) is negligible with high probability and that √nb

Roracle

nand

√ne

Rnare close. Finally, we apply an anti-concentration argument to prove

the convergence in the Kolmogorov distance.

Step 1: Gaussian approximation. First we prove that, under Condi-

tion 4.1 in the main text, there exists a random variable e

T∗d

R∗such that,

for every δ∈(0, K0K1],

P√ne

Rn−e

T∗≥16δ.(K0K1)3(log p)2

δ3√n+ (K0K1)4(log p)5

δ4n.(D.7)

By the deﬁnition of Pin (D.4), we have

√ne

Rn= max

j∈[p]\[s]n−1/2

i=1

ej(p)TPXiεi,

where [p]\[s] = {s+ 1, . . . , p}. In addition, write Xi= (XT

i,1,XT

i,2)Twith

Xi,1∈Rs,Xi,2∈Rdand deﬁne e

yi=εie

Xi, where e

Xi=Xi,2−Σ21Σ−1

11 Xi,1

are such that E(e

Xi) = 0and E(e

Xie

i) = Σ22.1. In this notation, we can

rewrite √ne

Rnas

(D.8) √ne

Rn= max

j∈[d]n−1/2

i=1

ej(d)Te

yi.

Next, we use the coupling inequality (A.15) below with d=p−sto the

random vectors V1,...,Vnwhich, on this occasion, are deﬁned by Vi=

(Vi1, . . . , Vid)Twith Vij =ej(d)Te

yi. Since E(εiXi) = 0 and E(ε2

i|Xi) = 1, we

have E(Vi) = 0and E(ViVT

i) = Σ22.1. Then there exists a random variable

T∗d

=|e

Z|∞such that, for every δ > 0,

P√ne

Rn−e

T∗≥16δ

.B1

log p

δ2n+B2

(log p)2

δ3n3/2+B3

(log p)3

δ4n2+log n

n.(D.9)

In addition, note that the random vectors Vi= (Vi1, . . . , Vid)Tare such that

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 53

E(V2

ij) = eσj j ≤1 and

max

j∈[d]

ej(d)Te

yi

ψ1

≤2K0max

j∈[d]

ej(d)Te

Xi

ψ2

≤2K0max

j∈[p]\[s]

ej(p)TPΣ1/2Ui

ψ2

≤2K0K1max

j∈[p]\[s]ej(p)TPΣPTej(d)1/2

= 2K0K1max

j∈[d]eσ1/2

jj ≤2K0K1.(D.10)

Consequently, similar arguments to those leading to (A.16), (A.17) and

(A.18) in Appendix A.4 can be used to derive that

B1.(K0K1)2pnlog p+ (log p)3,

B2.(K0K1)3n+ (log p)4, B3.(K0K1)4n(log p)2.

Plugging the above bounds for B1–B3into (D.9) proves (D.7).

Step 2. First we prove that √nQ1and √nQ2are negligible with high prob-

ability, starting with √nQ1. Since Σ22.1=Σ22 −Σ21 Σ−1

11 Σ12 is positive

deﬁnite,

(D.11) max

j∈[d]Σ−1/2

11 Σ12 ej(d)2≤max

j∈[d]ej(d)TΣ22 ej(d)1/2= 1.

Again, from the identity XT

1X1δ1=XT

1εwe ﬁnd that

δT

1(n−1XT

1X1)δ1

=δT

1Σ1/2

11 n−1Σ−1/2

11 XT

1ε(D.12)

≤Σ1/2

11 δ12·n−1Σ−1/2

11 XT

1ε2

≤Σ1/2

11 δ12·√sn−1Σ−1/2

11 XT

1ε∞.

To bound the left-hand side of (D.12) from below, note that

δT

1(n−1XT

1X1)δ1

=δT

1Σ1/2

11 n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −IsΣ1/2

11 δ1+δT

1Σ11δ1

≥1−

n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −Is

δT

1Σ11δ1.(D.13)

54 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Recall that Xi= (XT

i,1,XT

i,2)T,X1Σ−1/2

11 = (Σ−1/2

11 X1,1,...,Σ−1/2

11 Xn,1)T∈

Rn×s. Under Condition 4.1,

sup

α∈Ss−1

αTΣ−1/2

11 Xi,1

ψ2= sup

α∈Ss−1

αTΣ−1/2

11 (Is,0)Σ1/2U

ψ2

≤K1sup

α∈Ss−1Σ1/2(Is,0)TΣ−1/2

11 α2

=K1sup

α∈Ss−1|α|2=K1,

which, together with Theorem 5.39 in Vershynin (2012) yields that, for every

t≥0,

(D.14) 

n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −Is

.max(δ, δ2)

holds with probability at least 1−2 exp(−cBt2), where δ=K2

1n−1/2(√s+t).

By (D.12), (D.13) and taking t=c−1/2

Bplog(2n) in (D.14), we have with

probability at least 1 −n−1,

(D.15) 1

2δT

1Σ11δ1≤n−1XT

1X1δ1≤Σ1/2

11 δ12·√sn−1Σ−1/2

11 XT

1ε∞

whenever the sample size nsatisﬁes n&K4

1(s+ log n).

To bound the right-hand side of (D.15), we deﬁne ξij =ej(s)TΣ−1/2

11 Xi,1εi

such that |n−1Σ−1/2

11 XT

1ε|∞= maxj∈[s]|n−1Pn

i=1 ξij|. Under Condition (4.1),

we have E(ξij ) = 0, E(ξ2

ij) = 1 and

kξijkψ1≤2kεkψ2

ej(s)TΣ−1/2

11 (Is,0)Σ1/2U

ψ2

≤2K0K1ej(s)TΣ−1/2

11 (Is,0)Σ1/22= 2K0K1.

Using the union bound and inequality (7.1) in the main text implies that,

for every t > 0,

PnΣ−1/2

11 XT

1ε∞>2K0K1max √nt, to≤2sexp(−cBt).(D.16)

Taking respectively t=c−1/2

Bplog(2n) and t=c−1

Blog(2sn) in (D.14) and

(D.16) yields, with probability at least 1 −2n−1,



n−1Σ−1/2

11 XT

1X1Σ−1/2

11 −Is

Σ1/2

11 δ12.K0K3

1n−1slog n

whenever n&K4

1(s+ log n). Combining this with (D.11), we have with the

same probability,

(D.17) √nQ1.K0K3

1n−1/2slog n

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 55

whenever n&K4

1(s+ log n).

Turning to Q2, we deﬁne ξi,jk =Xij ek(s)TΣ−1/2

11 Xi,1such that

Q21 := max

j∈[p]\[s]

k∈[s]n−1

i=1

(Id −E)Xij ek(s)TΣ−1/2

11 Xi,1

= max

j∈[p]\[s]

k∈[s]n−1

i=1 ξi,jk −Eξi,jk

and Q2≤√s|Σ1/2

11 δ1|2Q21. To bound Q21, note that kξi,jk −Eξi,jk kψ1≤

2kξi,jk kψ1and



ξi,jk 

ψ1≤2kXijkψ2

ek(s)TΣ−1/2

11 Xi,1

ψ2

= 2

ej(p)TΣ1/2U

ψ2

ek(s)TΣ−1/2

11 (Is,0)Σ1/2U

ψ2

≤2K2

1Σ1/2ej(p)2·Σ1/2(Is,0)TΣ−1/2

11 ek(s)2= 2K2

Then using inequality (7.1) and the union bound again, we obtain that for

every t > 0,

PQ21 ≥4K2

1max pt/n, t/n≤2(p−s) exp(−cBt).

Taking t=c−1

Blog(2pn), we conclude from the bound on |Σ1/2

11 δ1|2estab-

lished earlier that, with probability at least 1 −3n−1,

(D.18) √nQ2.K0K3

1n−1/2slog p

whenever n&K4

1(s+ log n).

Putting (D.6), (D.17) and (D.18) together implies that, with probability

at least 1 −3n−1,

√nRoracle

n−√ne

Rn.K0K3

1n−1/2slog p(D.19)

whenever n&K4

1(s+ log n).

Next, we prove that √nb

Roracle

nand √ne

Rnare close with high probability.

To this end, set b

ε= (bε1,...,bεn)Tand deﬁne bσ2=n−1Pn

i=1(bεi−¯ε)2,bσ2

n−1Pn

i=1(Xij −¯

Xj)2, where ¯ε=eT

εand en= (1/n, . . . , 1/n)T∈Rn. In

this notation, we have

√nb

Roracle

n=bσ−1max

j∈[p]bσ−1

jn−1/2

i=1 bεiXij −√n¯ε¯

Xj.

56 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Combined with (D.1), this implies

√nb

Roracle

n−√nRoracle

n

≤max

j∈[p](bσbσj)−1−1·√nRoracle +√nbσ−1|¯ε|max

j∈[p]bσ−1

j|¯

Xj|.(D.20)

In view of (D.19), it suﬃces to show that the right-hand side of (D.20)

is negligible with high probability. The following lemma provides deviation

inequalities for the variance estimators bσ2and bσ2

jas well as the sample

means |¯ε|and |¯

Xj|. The proof is deferred to Section H.

Lemma D.1.Assume that Condition 4.1 in Fan, Shao and Zhou (2015)

holds. Then, with probability at least 1−Cn−1,

max

j∈[p]|¯

Xj|.K1rlog p

n,max

j∈[p]bσ2

j−1.K2

1rlog p

n+log p

n

(D.21)

and

|¯ε|.K0K2

slog n

n,bσ2−1.K2

0rlog n

n+ (K0K1)2slog n

n,(D.22)

provided that n&K4

1slog n.

In addition, for √ne

Rnin (D.8), it follows from the union bound, in-

equality (7.1) in the main text and (D.10) that, with probability at least

1−2dexp(−cBt), √ne

Rn.K0K1max(√t, n−1/2t). This implies by taking

t=c−1

Blog(2pn) that, with probability at least 1 −n−1,

√ne

Rn.K0K1plog p.(D.23)

Combining (D.19), (D.20) and (D.23), we conclude from Lemma D.1 that,

with probability at least 1 −Cn−1,

√nb

Roracle −√ne

Rn.(K0∨K1)2K0K1n−1/2slog p,(D.24)

provided n&(K0∨K1)4slog p.

Step 3. From (D.7) with δ= (K0K1)3/4min{1, n−1/8(log p)3/8}and (D.24),

we obtain that, with probability at least 1 −C(K0K1)3/4n−1/8(log p)7/8,

√nb

Roracle

n−e

T∗.(K0K1)3/4(log p)3/8

n1/8+ (K0∨K1)2K0K1

slog p

√n,

(D.25)

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 57

where e

T∗d

R∗. For e

R∗=|e

Z|∞, it follows from Theorem 3, (ii) in Cher-

nozhukov, Chetverikov and Kato (2015) and the fact maxj∈[d]eσjj ≤1 that,

for every  > 0,

sup

t≥0

Pe

R∗−t≤≤e

Cplog d+plog(1/),(D.26)

where e

C > 0 depends only on eσmin = minj∈[d]eσjj , which under Condi-

tion 4.2, is bounded away from zero.

Finally, combining (D.25) and (D.26) leads to

sup

t≥0P√nb

Roracle

n≤t−Pe

R∗≤t

.(K0K1)3/4n−1/8(log p)7/8+ (K0∨K1)2K0K1n−1/2slog p.

The conclusion of the theorem follows immediately.

APPENDIX E: PROOF OF THEOREM 4.2

In view of Theorem 4.1, we only need to prove the strong oracle property

of b

βlla, i.e.

(E.1) Pb

βlla =b

βoracle→1 as n→ ∞.

Together, (E.1) and (4.6) prove (4.9).

To prove (E.1), deﬁne events

A1=max

1≤j≤pn−1

i=1

ij ≤2,A2=nκ(s, 3,Sn)≥1

2κ(s, 3,Σ)o,(E.2)

where Sn=n−1XTX. Given {Xi}n

i=1 and on the event A1∩A2, applying The-

orem 1 and Corollary 3 in Fan, Xue and Zou (2014) gives, with conditional

probability at least 1 −2pexp(−c0nλ2

lasso/K2

0)−2(p−s) exp(−c1nλ2/K2

over {εi}n

i=1, the computed estimator b

βlla equals the oracle estimator b

βoracle,

provided λ≥8√s λlasso

κ(s,3,Σ), where c0, c1>0 are absolute constants. Taking into

account the randomness of {Xi}n

i=1, we obtain that

Pb

βlla 6=b

βoracle

≤2pexp(−c0nλ2

lasso/K2

0) + 2(p−s) exp(−c1nλ2/K2

0) + P(Ac

1) + P(Ac

2).

It remains to show that the events A1and A2in (E.2) hold with over-

whelming probability. Using the union bound and the one-sided version of

58 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

inequality (7.1), we ﬁnd that the probability of the complementary event

1satisﬁes P(Ac

1)≤pexp(−c2n/K4

1). Under Condition 4.1,Xi=Σ1/2Ui,

where U1,...,Unare i.i.d. Rp-valued isotropic random vectors. Then it

follows from Theorem 6 and Remark 15 in Rudelson and Zhou (2013) by

taking δ= 1 −√2/2, s0=s,k0= 3, q=p,α=K1,A=Σ1/2and

Ψ = (U1,...,Un)T∈Rn×pthere that, P(Ac

2)≤2 exp(−c3n/K4

1) when-

ever the sample size nsatisﬁes n&K4

1slog p

κ(s,3+,Σ). Here, c2, c3>0 are absolute

constants. The proof of Theorem 4.2 is then complete.

APPENDIX F: PROOF OF PROPOSITION 3.1

F.1. Preliminaries. First, we introduce basic notation and deﬁnitions

that will be used to prove Proposition 3.1.

F.1.1. m-generated convex set. For any convex set A⊆Rp, its support

function is deﬁned as v7→ SA(v) := sup{wTv:w∈A}for v∈Sp−1, such

that Acan be written as

A=\

v∈Sp−1w∈Rp:wTv≤SA(v).

Following Chernozhukov, Chetverikov and Kato (2014b), we say that Ais

m-generated if it is generated by intersections of mhalf-spaces; that is, there

exists a subset V(A)⊆Sp−1consisting of munit vectors that are outward

normal to the faces of Asuch that

A=\

v∈V(A)w∈Rp:wTv≤SA(v).

Moreover, for m≥1 and  > 0, we say that a convex set Aadmits an

approximation with precision by an m-generated convex set Amif Am⊆

A⊆Am,, where Am, := ∩v∈V(Am){w∈Rp:wTv≤SAm(v) + }.

F.1.2. Sparsely convex set. In this section, we consider a particular class

of convex sets that can be approximated by m-generated convex sets with a

pre-speciﬁed precision for some ﬁnite m≥1.

Definition F.1 (Sparsely convex sets).Let 1≤s≤pand Q≥1be two

integers. We say that A⊆Rpis an (s, Q)-sparsely convex set if A=∩Q

q=1Aq,

where for each q,Aqis a convex set and is such that the map w7→ I(w∈Aq)

depends at most on scomponents of w= (w1, . . . , wp)T∈Rp. We refer to

A=∩Q

q=1Aqas a sparse representation of A.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 59

The class of sparsely convex sets can be regarded as a generalization of

the class of the rectangles. We refer to Chernozhukov, Chetverikov and Kato

(2014b) for a detailed introduction and more concrete examples. In particu-

lar, the following result which is Lemma D.1 there shows that under suitable

conditions, sparsely convex sets can be approximated by m-generated convex

sets with pre-speciﬁed precisions.

Lemma F.1.Assume that A is an (s, Q)-sparsely convex set satisfying

(i). 0∈A, (ii). supw∈A|w|2≤Rfor some R > 0and (iii). A=∩Q

q=1Aq,

where for each q,−A1⊆µAqfor some µ≥1. Then for every γ > e/8,

there exists 0=0(γ)>0such that for any 0<  < 0,Aadmits an

approximation with precision R by an m-generated convex set Amsatisfying

that (i). |v|0≤sfor all v∈ V(Am), and (ii). m≤Qγqµ+1

log 1

s2

F.1.3. Central limit theorem for simple convex sets. Let X1,...,Xnbe

i.i.d. p-dimensional random vectors with mean zero and covariance matrix Σ,

and let Zbe a p-dimensional centered Gaussian random vector with the same

covariance matrix. Assume that diag(Σ) = Ip. Write W=n−1/2Pn

i=1 Xi.

For a given class Aof Borel sets in Rp, the problem of bounding the quantity

ρn(A) = supA∈A |P(W∈A)−P(Z∈A)|, which characterizes the rate of

convergence to normality with respect to A, is of long-standing interest.

In this section, we focus on a particular class of convex sets for which a

Berry-Esseen theorem can be established in high dimensions.

For integers 1 ≤s≤p,m≥1 and for δ≥0, we denote by Asc(s, m, δ)

the class of convex sets in Rpsatisfying that, every A∈ Asc(s, m, δ) admits

an approximation with precision δby an m-generated convex set Amwhich

can be chosen to satisfy |v|0≤sfor all v∈Am. We refer to Asc(s, m, δ )

as a class of simple convex sets. The following Berry-Esseen-type result is

a modiﬁcation of Proposition 3.2 in Chernozhukov, Chetverikov and Kato

(2014b).

Lemma F.2.There exists some integer 1≤s≤psuch that

K= sup

u∈Sp−1:|u|0≤skuTX1kψ1<∞and σ2

min = inf

u∈Sp−1:|u|0≤s

E(uTX1)2>0.

Then, there exists an absolute constant C > 0such that for any m≥1and

δ > 0,

sup

A∈Asc(s,m,δ)|P(W∈A)−P(Z∈A)|

≤Cσ−1

minhKn−1/6{log(mn)}7/6+δplog mi.

60 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

F.2. Proof of the proposition. First, we deﬁne the following stan-

dardized counterparts of √nb

Rn(k, p) for k= 1, . . . , s:

Ln(k, p) = max

u∈Sp−1:|u|0=k|uTW|= max

S⊆[p]:|S|=k|WS|2,

where W=n−1/2Pn

i=1 yiwith yi=εiXifor i= 1, . . . , n.

The following lemma shows that, after properly normalized, the joint

distribution of {Ln(k, p)}s

k=1 can be consistently estimated by that of the

top sorder statistics of i.i.d. chi-square random variables with 1 degree of

freedom. Recall that Z= (Z1, . . . , Zp)Td

=N(0,Ip), and Z2

(1) ≤Z2

(2) ≤ ··· ≤

(p)denote the order statistics of {Z2

1, . . . , Z2

p}.

Lemma F.3.Assume that Conditions 2.1 and 2.2 in the main text hold

with Σ=Ip. Then there is an absolute constant C > 0such that

sup

0<t1<···<tk≤2Ps

k=1 n−1/2Ln(k, p)≤tk

−Ps

k=1 n−1/2R∗(k, p)≤tk≤C(K0K1)1/3{s2log(pn)}7/6

n1/6,

where R∗(k, p)2= maxu∈Sp−1:|u|0=k(uTZ)2=Pk

ν=1 Z2

(ν).

Further, deﬁne

L=√nb

Rn(1, p),..., b

Rn(s, p)T,L=Ln(1, p), . . . , Ln(s, p)T.

Then it is easy to see that |b

L−L|∞≤max1≤k≤s|√nb

Rn(k, p)−Ln(k, p)|.

Taking V:= ∪s

k=1V(k, p) with V(k, p) = {x7→ uTx:u∈Sp−1,|u|0=k}as

in Lemma 7.2, the same conclusions there hold by a similar argument. Con-

sequently, it follows from a modiﬁcation of Lemma 7.6 that, with probability

greater than 1 −C1n−1/2{cn(s, p)}1/2,

(F.1) |b

L−L|∞≤C2(K0∨K1)2K0K1n−1/2cn(s, p)

whenever n≥C3(K0∨K1)4cn(s, p), where cn(s, p) := slog(ep/s)∨log n.

Together, (F.1) and Lemma F.3 imply that for any 0 < t1<··· < ts<1

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 61

and all suﬃciently large n,

Ps

k=1 b

Rn(k, p)≤tk

≤Ps

k=1 Ln(k, p)≤√n(tk+n)+C1n−1/2{cn(s, p)}1/2

≤Ps

k=1 n−1/2R∗(k, p)≤tk+n+C(K0K1)1/3n−1/6{s2log(pn)}7/6

≤Ps

k=1 n−1/2R∗(k, p)≤tk

+P√nts< R∗(s, p)≤√n(ts+n)+C(K0K1)1/3n−1/6{s2log(pn)}7/6

≤Ps

k=1 n−1/2Rk≤tk

+Cnpns log(ep/s)+(K0K1)1/3n−1/6{s2log(pn)}7/6,

where n=n(s, p) = C2(K0∨K1)2K0K1n−1cn(s, p)≤1 for all suﬃciently

large n. A similar argument leads to the reverse inequality, and hence com-

pletes the proof.

F.3. Proof of Lemma F.2.This proof is similar to that of Proposi-

tion 3.2 in Chernozhukov, Chetverikov and Kato (2014b) with slight modi-

ﬁcation. We reproduce them here for the sake of readability.

For every A∈ Asc(s, m, δ), let Ambe the approximating m-generated

convex set of Asuch that Am⊆A⊆Am,δ. Put

ρ= max |P(W∈Am)−P(Z∈Am)|,|P(W∈Am,δ )−P(Z∈Am,δ )|.

Applying Lemma A.1 in Chernozhukov, Chetverikov and Kato (2014b) and

Theorem 20 in Klivans, O’Donnell and Servedio (2008) to the m-dimensional

Gaussian random vector (vTZ)v∈V(Am)implies that

|P(W∈A)−P(Z∈A)| ≤ σ−1

minp2 log m+ 2δ+ρ.(F.2)

Recall that K= supu∈Sp−1:|u|0≤skuTXikψ1<∞. Then, for every q≥3

and v∈ V(Am) with |v|0≤s,

n−1

i=1

E|vTXi|q≤n−1

i=1

(qkvTXikψ1)q≤(qK)q.

62 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Consequently, it follows from Proposition 2.1 in Chernozhukov, Chetverikov

and Kato (2014b) that

ρ.σ−1

minK n−1/6{log(mn)}7/6.(F.3)

Together, (F.2) and (F.3) complete the proof of the lemma.

F.4. Proof of Lemma F.3.For any 0 < t1<·· · < ts≤2, we have

PLn(1, p)≤√nt1, Ln(2, p)≤√nt2,···, Ln(s, p)≤√nts

=PW∈

k=1

Ak(wk),

where wk=nt2

kfor k= 1, . . . , s and for t≥0,

Ak(t) := w∈Rp:|wS|2

2≤tfor all S⊆[p] with |S|=k.

Put A(t) = ∩s

k=1Ak(wk), where t= (t1, . . . , ts). For every 1 ≤k≤s, let

{Sk`}(p

`=1 be all the subsets of [p] with cardinality k. In this notation, we can

further write the set A(t) as

A(t) =

k=1

`=1

Ak` =

k=1

`=1 w∈Rp:|wSk` |2

2≤wk.

It is easy to see that the indicator function w= (w1, . . . , wp)T∈Rp7→

I(w∈Ak`) depends only on k(≤s) components of w. By Deﬁnition F.1,

A(t) is an (s, (ep/s)s)-sparsely convex set. Then it follows from Lemma F.1

with R= 2(pn)1/2and γ=µ= 1 that there exists some constant 0>0

such that for every ∈(0, 0) and 0 < t1<··· < ts<1, the set A(t) admits

an approximation with precision 2(pn)1/2by an m-generated convex set

Am, where

m≤ep

ssr2

log 1

s2

In particular, taking = (pn)−1yields, for any t= (t1, . . . , ts) with 0 < t1<

t2<··· < ts≤2, A(t)∈ Ascs, (ep/s)s{√2pn log(pn)}s2,2(pn)−1/2. This,

together with Lemma F.2 and the inequality kuTyikψ1≤2kεikψ2kuTXikψ2≤

2K0K1that holds for all u∈Sp−1completes the proof of the lemma.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 63

APPENDIX G: PROOF OF PROPOSITION 3.2

Observe that Z2

1, . . . , Z2

pare i.i.d. chi-square random variables with 1 de-

gree of freedom. For t∈Rﬁxed, let tp=ap+twith ap= 2 log p−log(log p)

such that as p→ ∞,

PZ2

(p)≤tp=1−P(Z2

1> tp)p→exp −1

√πe−t/2.

In other words, Z2

(p)−apconverges weakly to a Gumbel distribution with

a cumulative distribution function given by G(t) = exp(−π−1/2e−t/2) for

t∈R. Consequently, for every s≥2 ﬁxed, the s-dimensional vector

Z2

(p)−ap, . . . , Z2

(p−s+1) −ap

has a limiting distribution with joint density function given by [David and

Nagaraja (2003)]

gs(t1, . . . , ts) = G(ts)

j=1

g(ti)

G(ti), t1> t2>··· > ts,

where g(t) = G0(t) = e−t/2

2√πG(t) for t∈R, and it is easy to verify that

gs(t1, . . . , ts) = 1

2√πs−1

exp −1

s−1

j=1

tjg(ts), t1> t2>··· > ts.

Consequently, as p→ ∞,

PZ2

(p)+· ·· +Z2

(p−s+1) −sap≤t

→Z···Zt1+···ts≤t,t1>···>ts

gs(t1, . . . , ts)dt1··· dts

=1

2√πs−1Zt/s

−∞ Z···Zt1+···+ts−1≤t−ts,t1>···>ts−1>ts

s−1

j=1

e−tj/2dtjg(ts)dts

=1

√πs−1Zt/s

−∞ Z···Zt1+···+ts−1≤(t−ts)/2,t1>···>ts−1>ts/2

s−1

j=1

e−tjdtjg(ts)dts

=1

√πs−1Zt/s

−∞

dtse−(s−1)ts/2g(ts)

×Z···Zu1+···+us−1≤(t−sts)/2, u1>···>us−1>0

e−u1−···−us−1du1··· dus−1.

(G.1)

64 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Now, let E1, . . . , Es−1be i.i.d. standard exponential distributed random

variables and let E(1) ≥E(2) ≥ ··· ≥ E(s−1) be the corresponding order

statistics. It is known that the joint density function of (E(1), . . . , E(s−1) ) is

(s−1)!e−t1−···−ts−1,t1> t2>··· > ts−1≥0. Therefore, the last multiple

integral on the right side of (G.1) is equal to

(s−1)!PE(1) +· ·· +E(s−1) ≤(t−sts)/2

(s−1)!PE1+· ·· +Es−1≤(t−sts)/2

(s−1)!Γ(s−1) Z(t−sts)/2

us−2e−udu,

where we used the fact that E1+·· · +Es−1d

= Gamma(s−1,1). Putting

the above calculations together yields (3.7).

To prove (3.8), observe that for any a > 0 and positive integer `,

Γ(`)Za

u`−1e−udu = 1 −

`−1

j=0

j!e−a.

Hence, for s≥2,

Γ(s−1) Zt/s

−∞ Z(t−sv)/2

us−2e−udue−(s−1)v/2g(v)dv

=Zt/s

−∞

e−(s−1)v/2g(v)dv

−

s−2

j=0

j!2jZt/s

−∞

(t−sv)je−(t−sv)/2−(s−1)v/2g(v)dv

=Zt/s

−∞

e−(s−1)v/2g(v)dv −e−t/2

s−2

j=0

j!2jZt/s

−∞

(t−sv)jev/2g(v)dv.(G.2)

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 65

Further, using integration by parts repeatedly gives

Zt/s

−∞

e−(s−1)v/2g(v)dv =Zt/s

−∞

e−(s−1)v/2dG(v)

=G(t/s)e−(s−1)t/(2s)+ (s−1)√πZt/s

−∞

e−(s−2)v/2dG(v)

=G(t/s)ne−(s−1)t/(2s)+ (s−1)√πe−(s−2)t/(2s)o

+ (s−1)(s−2)πZt/s

−∞

e−(s−3)v/2dG(v)

=···

=G(t/s)π(s−1)/2(s−1)! + e−t/2+t/(2s)

(G.3)

+e−t/2

s−2

j=1

πj/2e(j+1)t/(2s)

`=1

(s−`).

The ﬁrst summand of the last term on the right-hand side of (G.2) reads to

(G.4) Zt/s

−∞

ev/2g(v)dv =et/(2s)G(t/s)−√πZt/s

−∞

evg(v)dv.

Assembling (G.2)–(G.4) completes the proof of Proposition 3.2.

APPENDIX H: PROOF OF LEMMA C.1

We continue to adopt the notation in the proof of Theorem 4.1. To prove

(D.21), consider the inequality |bσ2

j−1|≤|n−1Pn

i=1 X2

ij −1|+¯

j. Anal-

ogously to (7.4), for every t1, t2>0 we have, with probability at least

1−2pexp(−cHt1)−2pexp(−cBt2), maxj∈[p]|¯

Xj| ≤ K1n−1/2√t1and

max

j∈[p]bσ2

j−1≤K2

1n−1t1+ 4K2

1max n−1/2√t2, n−1t2.

In particular, taking t1=c−1

Hlog(2pn) and t2=c−1

Blog(2pn) proves (D.21).

Next we prove (D.22). Recall that Y=X1β1+εand b

ε= (bε1,...,bεn)T=

Y−X1b

β1. Therefore, we have

i=1 bε2

i=b

εTb

ε=εTε−2εTX1δ1+|X1δ1|2

¯ε=eT

ε=eT

nε−n−1

i=1

i,1δ1.

66 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

This and (D.16) yield |b

εTb

ε−εTε| ≤ 3√s|Σ−1/2

11 XT

1ε|∞·|Σ1/2

11 δ1|2and |eT

n(b

ε−

ε)| ≤ |Σ1/2

11 δ1|2·√s|n−1Pn

i=1 Σ−1/2

11 Xi,1|∞. Applying the union bound and

inequality (7.2) we obtain that, for every t > 0, |n−1/2Pn

i=1 Σ−1/2

11 Xi,1|∞≤

K1tholds with probability at least 1 −2sexp(−cHt2). Hence, taking t=

c−1/2

Hplog(2sn), we conclude from (D.12), (D.15) and (D.16) that, with

probability at least 1 −3n−1,

b

εTb

ε−εTε.(K0K1)2slog n, eT

n(b

ε−ε).K0K2

1n−1slog n,(H.1)

provided that n&K4

1(s+ log n).

Finally, from (H.1) and (7.4) we obtain, with probability at least 1−5n−1,

bσ2−1.K2

0rlog n

n+ (K0K1)2slog n

whenever n&K4

1slog n. This, together with (H.1) proves (D.22).

APPENDIX I: ADDITIONAL SIMULATION RESULTS

In this section, we present additional numerical results for detecting spu-

rious discoveries in the case of non-Gaussian design and noise. We con-

tinue with the setup in Section 6.3 by taking n= 120,160, p= 400 and

β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp. For r∈ {120,200,280,360},

we let x= (x1, . . . , xr)T, where x1, . . . , xrare i.i.d. random variables follow-

ing the continuous uniform distribution on [−1,1]. The rows of the design

matrix Xare sampled as i.i.d. copies from Γrx∈Rp, where Γris a p×r

matrix satisfying ΓT

rΓr=Ir. Moreover, the noise variable εfollows a stan-

dardized t-distribution with 4 degrees of freedom. We compute the empirical

SDP based on 200 simulations. The results are provided in Table 6.

Table 6

The empirical α-level Spurious Discovery Probability (ESDP) based on 200 simulations

when p= 400,n= 120,160 and α= 5%.

Trule r= 120 r= 200 r= 280 r= 360

n= 120 0.9100 0.8000 0.7200 0.5900

n= 160 0.7650 0.6100 0.3600 0.2750

REFERENCES

Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes

with applications to Markov chains. Electron. J. Probab. 13 1000–1034.

Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation

study. Ann. Inst. Henri Poincar´e Probab. Stat. 48 1148–1185.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 67

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014a). Gaussian approximation

of suprema of empirical processes. Ann. Statist. 42 1564–1597.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2014b). Central limit theorems

and bootstrap in high dimensions. Available at arXiv:1412.3661.

Chernozhukov, V., Chetverikov, D. and Kato, K. (2015). Comparison and anti-

concentration bounds for maxima of Gaussian random vectors. Probab. Theory Relat.

Fields 162 47–70.

David, H. A. and Nagaraja, H. N. (2003). Order Statistics (3rd ed). Wiley-Interscience,

Hoboken, NJ.

de la Pe˜

na, V. H., Lai, T. L. and Shao, Q.-M. (2009). Self-Normalized Processes: Limit

Theory and Statistical Applications. Springer, Berlin.

Fan, J., Shao, Q.-M. and Zhou, W.-X. (2015). Are discoveries spurious? Distributions of

maximum spurious correlations and their applications. Available at arXiv:1502.04237.

Fan, J., Xue, L. and Zou, H. (2014). Strong oracle optimality of folded concave penalized

estimation. Ann. Statist. 42 819–849.

Klivans, A., O’Donnell, R. and Servedio, R. (2008). Learning geometric concepts

via Gaussian surface area. In Proceedings of 49th IEEE Symposium on Foundations of

Computer Science 541–550.

Mendelson, S. (2010). Empirical processes with a bounded ψ1diameter. Geom. Funct.

Anal. 20 988–1027.

Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measure-

ments. IEEE Trans. Inform. Theory 59 3434–3447.

Sun, Q., Zhou, W.-X. and Fan, J. (2016). Adaptive Huber regression: Optimality and

phase transition. Preprint.

Talagrand, M. (2005). The Generic Chaining. Springer-Verlag, Berlin.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical

Processes: With Applications to Statistics. Springer, New York.

Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices.

In Compressed Sensing: Theory and Applications (Y. Eldar and G. Kutyniok, eds.)

210–268. Cambridge University Press, Cambridge.

Wang, T., Berthet, Q. and Samworth, R. J. (2016). Statistical and computational

trade-oﬀs in estimation of sparse principal components. Ann. Statist. 44 1896–1930.

Guarding against Spurious Discoveries in High Dimensions

Article

Full-text available

Nov 2016
J MACH LEARN RES

Many data mining and statistical machine learning algorithms have been developed to select a subset of covariates to associate with a response variable. Spurious discoveries can easily arise in high-dimensional data analysis due to enormous possibilities of such selections. How can we know statistically our discoveries better than those by chance? In this paper, we define a measure of goodness of spurious �fit, which shows how good a response variable can be �fitted by an optimally selected subset of covariates under the null model, and propose a simple and effective LAMM algorithm to compute it. It coincides with the maximum spurious correlation for linear models and can be regarded as a generalized maximum spurious correlation. We derive the asymptotic distribution of such goodness of spurious �fit for generalized linear models and $L_1$ regression. Such an asymptotic distribution depends on the sample size, ambient dimension, the number of variables used in the �fit and the covariance information. It can be consistently estimated by multiplier bootstrapping and used as a benchmark to guard against spurious discoveries. It can also be applied to model selection, which considers only candidate models with goodness of fits better than those by spurious �fits. The theory and method are convincingly illustrated by simulated examples and an application to the binary outcomes from German Neuroblastoma Trials.

Testing High Dimensional Mean Under Sparsity

Article

Full-text available

Sep 2015

Xianyang Zhang

Motivated by the likelihood ratio test under the Gaussian assumption, we develop a maximum sum-of-squares test for conducting hypothesis testing on high dimensional mean vector. The proposed test which incorporates the dependence among the variables is designed to ease the computational burden and to maximize the asymptotic power in the likelihood ratio test. A simulation-based approach is developed to approximate the sampling distribution of the test statistic. The validity of the testing procedure is justified under both the null and alternative hypotheses. We further extend the main results to testing the quality of two mean vectors without imposing the equal covariance assumption. Numerical results suggest that the proposed test can be more powerful than some existing alternatives.

Asymptotic Analysis for Extreme Eigenvalues of Principal Minors of Random Matrices

Preprint

May 2019

Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of real Wigner matrices are also obtained. In addition, we discuss an application of the theoretical results to the construction of compressed sensing matrices, which provides insights to compressed sensing in signal processing and high dimensional linear regression in statistics.

Statistical Validity and Consistency of Big Data Analytics: A General Framework

Article

Full-text available

Mar 2018

Informatics and technological advancements have triggered generation of huge volume of data with varied complexity in its management and analysis. Big Data analytics is the practice of revealing hidden aspects of such data and making inferences from it. Although storage, retrieval and management of Big Data seem possible through efficient algorithm and system development, concern about statistical consistency remains to be addressed in view of its specific characteristics. Since Big Data does not conform to standard analytics, we need proper modification of the existing statistical theory and tools. Here we propose, with illustrations, a general statistical framework and an algorithmic principle for Big Data analytics that ensure statistical accuracy of the conclusions. The proposed framework has the potential to push forward advancement of Big Data analytics in the right direction. The partition-repetition approach proposed here is broad enough to encompass all practical data analytic problems.

Variable Selection for Highly Correlated Predictors

Article

Sep 2017

Penalty-based variable selection methods are powerful in selecting relevant covariates and estimating coefficients simultaneously. However, variable selection could fail to be consistent when covariates are highly correlated. The partial correlation approach has been adopted to solve the problem with correlated covariates. Nevertheless, the restrictive range of partial correlation is not effective for capturing signal strength for relevant covariates. In this paper, we propose a new Semi-standard PArtial Covariance (SPAC) which is able to reduce correlation effects from other predictors while incorporating the magnitude of coefficients. The proposed SPAC variable selection facilitates choosing covariates which have direct association with the response variable, via utilizing dependency among covariates. We show that the proposed method with the Lasso penalty (SPAC-Lasso) enjoys strong sign consistency in both finite-dimensional and high-dimensional settings under regularity conditions. Simulation studies and the `HapMap' gene data application show that the proposed method outperforms the traditional Lasso, adaptive Lasso, SCAD, and Peter-Clark-simple (PC-simple) methods for highly correlated predictors.

Testing Endogeneity with Possibly Invalid Instruments and High Dimensional Covariates

Article

Full-text available

Sep 2016

The Durbin-Wu-Hausman (DWH) test is a commonly used test for endogeneity in instrumental variables (IV) regression. Unfortunately, the DWH test depends, among other things, on assuming all the instruments are valid, a rarity in practice. In this paper, we show that the DWH test often has distorted size even if one IV is invalid. Also, the DWH test may have low power when many, possibly high dimensional, covariates are used to make the instruments more plausibly valid. To remedy these shortcomings, we propose a new endogeneity test which has proper size and better power when invalid instruments and high dimemsional covariates are present; in low dimensions, the new test is optimal in that its power is equivalent to the "oracle" DWH test's power that knows which instruments are valid. The paper concludes with a simulation study of the new test with invalid instruments and high dimensional covariates.

Spherical Cap Packing Asymptotics and Rank-Extreme Detection

Article

Nov 2015
IEEE T INFORM THEORY

Kai Zhang

We study the spherical cap packing problem with a probabilistic approach. Such probabilistic considerations result in an asymptotic sharp universal uniform bound on the maximal inner product between any set of unit vectors and a stochastically independent uniformly distributed unit vector. When the set of unit vectors are themselves independently uniformly distributed, we further develop the extreme value distribution limit of the maximal inner product, which characterizes its uncertainty around the bound. As applications of the above asymptotic results, we derive (1) an asymptotic sharp universal uniform bound on the maximal spurious correlation, as well as its uniform convergence in distribution when the explanatory variables are independently Gaussian distributed; and (2) an asymptotic sharp universal bound on the maximum norm of a low-rank elliptically distributed vector, as well as related limiting distributions. With these results, we develop a fast detection method for a low-rank structure in high-dimensional Gaussian data without using the spectrum information.

Sparse Sliced Inverse Regression Via Lasso

Article

Full-text available

Sep 2018
J AM STAT ASSOC

For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if ρ=limpn=0, where p is the dimension and n is the sample size. Thus, when p is of the same or a higher order of n, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieve the optimal convergence rate under certain sparsity conditions when p is of order o(n²λ²), where λ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples.

Bootstrapping Max Statistics in High Dimensions: Near-Parametric Rates Under Weak Variance Decay and Application to Functional Data Analysis

Preprint

Jul 2018

In recent years, bootstrap methods have drawn attention for their ability to approximate the laws of "max statistics" in high-dimensional problems. A leading example of such a statistic is the coordinate-wise maximum of a sample average of $n$ random vectors in $\mathbb{R}^p$. Existing results for this statistic show that the bootstrap can work when $n\ll p$, and rates of approximation (in Kolmogorov distance) have been obtained with only logarithmic dependence in $p$. Nevertheless, one of the challenging aspects of this setting is that established rates tend to scale like $n^{-1/6}$ as a function of $n$. The main purpose of this paper is to demonstrate that improvement in rate is possible when extra model structure is available. Specifically, we show that if the coordinate-wise variances of the observations exhibit decay, then a nearly $n^{-1/2}$ rate can be achieved, independent of $p$. Furthermore, a surprising aspect of this dimension-free rate is that it holds even when the decay is very weak. As a numerical illustration, we show how these ideas can be used in the context of functional data analysis to construct simultaneous confidence intervals for the Fourier coefficients of a mean function.

Sparse Sliced Inverse Regression for High Dimensional Data

Article

Full-text available

Nov 2016

For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) subspace if and only if the dimension $p$ and sample size $n$ satisfies that $\rho=\lim\frac{p}{n}=0$. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity have to be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, $\widehat{var}(\mathbb{E}[\boldsymbol{x}|y])$, we introduce a simple Lasso regression method to obtain an estimate of the SDR subspace. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieve the optimal convergence rate under certain sparsity conditions when $p$ is of order $o(n^2\lambda^2)$ where $\lambda$ is the generalized signal noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples.

Variable selection via nonconcave penalized likelihood and its oracle properties

Article

Jan 2011
J Am Stat Assoc

Multiple Testing Procedures With Applications to Genomics

Book

Jan 2008

Multiple Hypothesis Testing.- Test Statistics Null Distribution.- Overview of Multiple Testing Procedures.- Single-Step Multiple Testing Procedures for Controlling General Type I Error Rates, ?(Fvn).- Step-Down Multiple Testing Procedures for Controlling the Family-Wise Error Rate.- Augmentation Multiple Testing Procedures for Controlling Generalized Tail Probability Error Rates.- Resampling-Based Empirical Bayes multiple Testing Procedures for Controlling Generalized Tail Probability Error Rates.- Simulation Studies: Assessment of Test Statistics Null Distributions.- Identification of Differentially Expressed and Co-Expressed Genes in High-Throughput Gene Expression Experiments.- Multiple Tests of Association with Biological Annotation Metadata.- HIV-1 Sequence Variation and Viral Replication Capacity.- Genetic Mapping of Complex Human Traits Using Single Nucleotide Polymorphisms: The ObeLinks Project.- Software Implementation.

Branch-and-bound applications in combinatorial data analysis

Article

Jan 2005

Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors

Article

Dec 2013
ANN STAT

We derive a Gaussian approximation result for the maximum of a sum of high-dimensional random vectors. Specifically, we establish conditions under which the distribution of the maximum is approximated by that of the maximum of a sum of the Gaussian random vectors with the same covariance matrices as the original vectors. This result applies when the dimension of random vectors (p) is large compared to the sample size (n); in fact, p can be much larger than n, without restricting correlations of the coordinates of these vectors. We also show that the distribution of the maximum of a sum of the random vectors with unknown covariance matrices can be consistently estimated by the distribution of the maximum of a sum of the conditional Gaussian random vectors obtained by multiplying the original vectors with i.i.d. Gaussian multipliers. This is the Gaussian multiplier (or wild) bootstrap procedure. Here too, p can be large or even much larger than n. These distributional approximations, either Gaussian or conditional Gaussian, yield a high-quality approximation to the distribution of the original maximum, often with approximation error decreasing polynomially in the sample size, and hence are of interest in many applications. We demonstrate how our Gaussian approximations and the multiplier bootstrap can be used for modern highdimensional estimation, multiple hypothesis testing, and adaptive specification testing. All these results contain nonasymptotic bounds on approximation errors.

Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction

Article

Jan 2010

Bradley Efron

We live in a new age for statistical inference, where modern scientific technology such as microarrays and fMRI machines routinely produce thousands and sometimes millions of parallel data sets, each with its own estimation or testing problem. Doing thousands of problems at once is more than repeated application of classical methods. Taking an empirical Bayes approach, Bradley Efron, inventor of the bootstrap, shows how information accrues across problems in a way that combines Bayesian and frequentist ideas. Estimation, testing, and prediction blend in this framework, producing opportunities for new methodologies of increased power. New difficulties also arise, easily leading to flawed inferences. This book takes a careful look at both the promise and pitfalls of large-scale statistical inference, with particular attention to false discovery rates, the most successful of the new statistical techniques. Emphasis is on the inferential ideas underlying technical developments, illustrated using a large number of real examples.

Two-sample test of high dimensional means under dependence

Article

Aug 2013
J Roy Stat Soc B Stat Meth

The paper considers in the high dimensional setting a canonical testing problem in multivariate analysis, namely testing the equality of two mean vectors. We introduce a new test statistic that is based on a linear transformation of the data by the precision matrix which incorporates the correlations between the variables. The limiting null distribution of the test statistic and the power of the test are analysed. It is shown that the test is particularly powerful against sparse alternatives and enjoys certain optimality. A simulation study is carried out to examine the numerical performance of the test and to compare it with other tests given in the literature. The results show that the test proposed significantly outperforms those tests in a range of settings.

Endogeneity in High Dimensions

Article

Jun 2014
ANN STAT

Yuan Liao

Most papers on high-dimensional statistics are based on the assumption that none of the regressors are correlated with the regression error, namely, they are exogenous. Yet, endogeneity can arise incidentally from a large pool of regressors in a high-dimensional regression. This causes the inconsistency of the penalized least-squares method and possible false scientific discoveries. A necessary condition for model selection consistency of a general class of penalized regression methods is given, which allows us to prove formally the inconsistency claim. To cope with the incidental endogeneity, we construct a novel penalized focused generalized method of moments (FGMM) criterion function. The FGMM effectively achieves the dimension reduction and applies the instrumental variable methods. We show that it possesses the oracle property even in the presence of endogenous predictors, and that the solution is also near global minimum under the over-identification assumption. Finally, we also show how the semi-parametric efficiency of estimation can be achieved via a two-step approach.

Simulation-based hypothesis testing of high dimensional means under covariance heterogeneity

Article

Full-text available

Dec 2017
Biometrics

Hypothesis testing for high-dimensional mean vectors has gained increasing attentions and stimulated innovative methodologies in statistics. In this paper, we introduce a fast-computational simulation-based testing procedure that allows for general covariance heterogeneity in the data for both one-sample and two-sample problems. The proposed procedure is based on the maximum-type statistics and the critical values are computed via the Gaussian approximation. Different from most existing tests that rely on certain structural restrictions on the covariance matrix, our method imposes no assumptions on the dependence structure of underlying distributions, which makes it can be easily applied in practice. Theoretical properties for the proposed testing procedure are systematically analyzed, including the rate of convergence, the asymptotic size and power. When testing against sparse alternatives, we develop a preliminary screening step to improve the power of the proposed tests and a data-driven procedure is also developed for practical implementations. Simulation studies demonstrate that, compared to existing methods, our testing procedure provides good performance against strong heterogeneity in the covariance structure of the data. The proposed procedure is also applied to the analysis of real datasets from leukemia studies in testing the significance of genes.

Necessary and sufficient conditions for the asymptotic distributions of coherence of ultra-high dimensional random matrices

Article

Full-text available

Feb 2014
ANN PROBAB

Let $\mathbf {x}_1,\ldots,\mathbf {x}_n$ be a random sample from a $p$-dimensional population distribution, where $p=p_n\to\infty$ and $\log p=o(n^{\beta})$ for some $0<\beta\leq1$, and let $L_n$ be the coherence of the sample correlation matrix. In this paper it is proved that $\sqrt{n/\log p}L_n\to2$ in probability if and only if $Ee^{t_0|x_{11}|^{\alpha}}<\infty$ for some $t_0>0$, where $\alpha$ satisfies $\beta=\alpha/(4-\alpha)$. Asymptotic distributions of $L_n$ are also proved under the same sufficient condition. Similar results remain valid for $m$-coherence when the variables of the population are $m$ dependent. The proofs are based on self-normalized moderate deviations, the Stein-Chen method and a newly developed randomized concentration inequality.

Local Properties of Distributions of Stochastic Functionals

Book

Jan 1998

Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

Methods used:

Recommendations

Statistical Inference in High-Dimensions Robust to Sparsity

Nonparametric Regression

Matrix completion

Two-step variable selection in quantile regression models

Guarding against Spurious Discoveries in High Dimensions

Two-Sample Smooth Tests for the Equality of Distributions

Adaptive Robust Variable Selection

Nonconcave Penalized Likelihood With NP-Dimensionality

Penalized Composite Quasi-Likelihood for Ultrahigh-Dimensional Variable Selection

Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications

Methods used:

Discover the world's research