Submitted to the Annals of Statistics

ARE DISCOVERIES SPURIOUS?

DISTRIBUTIONS OF MAXIMUM SPURIOUS

CORRELATIONS AND THEIR APPLICATIONS∗

By Jianqing Fank,∗∗, Qi-Man Shao†† and Wen-Xin Zhouk,‡‡

Princeton Universityk, Academy of Mathematics and Systems Science∗∗,

Chinese University of Hong Kong††

and University of California, San Diego‡‡

Over the last two decades, many exciting variable selection meth-

ods have been developed for ﬁnding a small group of covariates that

are associated with the response from a large pool. Can the discov-

eries from these data mining approaches be spurious due to high di-

mensionality and limited sample size? Can our fundamental assump-

tions about the exogeneity of the covariates needed for such variable

selection be validated with the data? To answer these questions, we

need to derive the distributions of the maximum spurious correlations

given a certain number of predictors, namely, the distribution of the

correlation of a response variable Ywith the best slinear combina-

tions of pcovariates X, even when Xand Yare independent. When

the covariance matrix of Xpossesses the restricted eigenvalue prop-

erty, we derive such distributions for both a ﬁnite sand a diverging

s, using Gaussian approximation and empirical process techniques.

However, such a distribution depends on the unknown covariance

matrix of X. Hence, we use the multiplier bootstrap procedure to ap-

proximate the unknown distributions and establish the consistency of

such a simple bootstrap approach. The results are further extended

to the situation where the residuals are from regularized ﬁts. Our

approach is then used to construct the upper conﬁdence limit for

the maximum spurious correlation and to test the exogeneity of the

covariates. The former provides a baseline for guarding against false

discoveries and the latter tests whether our fundamental assumptions

for high-dimensional model selection are statistically valid. Our tech-

niques and results are illustrated with both numerical examples and

real data analysis.

1. Introduction. Information technology has forever changed the data

collection process. Massive amounts of very high-dimensional or unstruc-

∗J. Fan was partially supported by NSF Grants DMS-1206464, DMS-1406266, NIH

Grant R01-GM072611-10, and National Center for Mathematics and Interdisciplinary Sci-

ences, Chinese Academy of Sciences, Beijing 100190. Q.-M. Shao was supported in part by

Hong Kong Research Grants Council GRF-403513 and GRF-14302515. W.-X. Zhou was

supported by NIH Grant R01-GM072611-10.

Keywords and phrases: High dimension, spurious correlation, bootstrap, false discovery.

1

arXiv:1502.04237v3 [math.ST] 11 Apr 2017

2J. FAN, Q.-M. SHAO AND W.-X. ZHOU

tured data are continuously produced and stored at an aﬀordable cost. Mas-

sive and complex data and high dimensionality characterize contemporary

statistical problems in many emerging ﬁelds of science and engineering. Var-

ious statistical and machine learning methods and algorithms have been pro-

posed to ﬁnd a small group of covariate variables that are associated with

given responses such as biological and clinical outcomes. These methods

have been very successfully applied to genomics, genetics, neuroscience, eco-

nomics, and ﬁnance. For an overview of high-dimensional statistical theory

and methods, see the review article by Fan and Lv (2010) and monographs

by Dudoit and van der Laan (2007), Hastie, Tibshirani and Friedman (2009),

Efron (2010) and B¨uhlmann and van de Geer (2011).

Underlying machine learning, data mining, and high-dimensional statis-

tical techniques, there are many model assumptions and even heuristic ar-

guments. For example, the LASSO [Tibshirani (1996)] and the SCAD [Fan

and Li (2001)] are based on an exogeneity assumption, meaning that all of

the covariates and the residual of the true model are uncorrelated. However,

it is nearly impossible that such a random variable, which is the part of

the response variable that can not be explained by a small group of covari-

ates and lives in a low-dimensional space spanned by the response and the

small group of variables, is uncorrelated with any of the tens of thousands

of coviariates. Indeed, Fan and Liao (2014) and Fan, Han and Liu (2014)

provide evidence that such an ideal assumption might not be valid, although

it is a necessary condition for model selection consistency. Even under the

exogenous assumption, conditions such as the restricted eigenvalue condi-

tion [Bickel, Ritov and Tsybakov (2009)] and homogeneity [Fan, Han and

Liu (2014)] are needed to ensure model selection consistency or oracle prop-

erties. Despite their critical importance, these conditions have rarely been

veriﬁed in practice. Their violations can lead to false scientiﬁc discoveries.

A simpler question is then, for a given data set, do data mining techniques

produce results that are better than spurious correlation? The answer de-

pends on not only the correlation between the ﬁtted and observed values,

but also on the sample size, the number of variables selected, and the total

number of variables.

To better appreciate the above two questions, let us consider an example.

We take the gene expression data on 90 Asians (45 Japanese and 45 Han

Chinese) from the international ‘HapMap’ project [Thorisson et al. (2005)].

The normalized gene expression data are generated with an Illumina Sentrix

Human-6 Expression Bead Chip [Stranger et al. (2007)] and are available

on ftp://ftp.sanger.ac.uk/pub/genevar/. We take the expressions of

gene CHRNA6, a cholinergic receptor, nicotinic, alpha 6, as the response

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 3

Fig 1: Histogram of the sample correlations between the residuals and each

covariate (blue) and histogram of N(0,1/√n) random variables (green).

Yand the remaining expressions of probes as covariates Xwith dimension

p= 47292. We ﬁrst ﬁt an `1-penalized least-squares regression (LASSO)

on the data with a tuning parameter automatically selected via ten-fold

cross validation (25 genes are selected). The correlation between the LASSO-

ﬁtted value and the response is 0.8991. Next, we reﬁt an ordinary least-

squares regression on the selected model to calculate the ﬁtted response and

residual vector. The sample correlation between the post-LASSO ﬁt and

observed responses is 0.9214, a remarkable ﬁt! But is it any better than the

spurious correlation? The model diagnostic plot, which depicts the empirical

distribution of the correlations between each covariate Xjand the residual

bεafter the LASSO ﬁt, is given in Figure 1. Does the exogenous assumption

that E(εXj) = 0 for all j= 1, . . . , p hold?

To answer the above two important questions, we need to derive the distri-

butions of the maximum spurious correlations. Let Xbe the p-dimensional

random vector of the covariates and XSbe a subset of covariates indexed by

S. Let dcorrn(ε, αT

SXS) be the sample correlation between the random noise

ε(independent of X) and αT

SXSbased on a sample of size n, where αSis

4J. FAN, Q.-M. SHAO AND W.-X. ZHOU

a constant vector. Then, the maximum spurious correlation is deﬁned as

(1.1) b

Rn(s, p) = max

|S|=smax

αSdcorrn(ε, αT

SXS),

when Xand εare independent, where the maximization is taken over all

p

ssubsets of size sand all of the linear combinations of the selected s

covariates. Next, let (Yi,Xi),...,(Yn,Xn) be independent and identically

distributed (i.i.d.) observations from the linear model Y=XTβ∗+ε. Assume

that scovariates are selected by a certain variable selection method for

some 1 ≤smin(p, n). If the correlation between the ﬁtted response and

observed response is no more than the 90th or the 95th percentile of b

Rn(s, p),

it is hard to claim that the ﬁtted value is impressive or even genuine. In this

case, the ﬁnding is hardly more impressive than the best ﬁt using data

that consist of independent response and explanatory variables, 90% or 95%

of the time. To simplify and unify the terminology, we call this result the

spurious discovery throughout this paper.

For the aforementioned gene expression data, as 25 probes are selected,

the observed correlation of 0.9214 between the ﬁtted value and the response

should be compared with the distribution of b

Rn(25, p). Further, a simple

method to test the null hypothesis

(1.2) E(εXj)=0,for all j= 1, . . . , p,

is to compare the maximum absolute correlation in Figure 1with the dis-

tribution of b

Rn(1, p). See additional details in Section 5.3.

The importance of such spurious correlation was recognized by Cai and

Jiang (2011), Fan, Guo and Hao (2012) and Cai, Fan and Jiang (2013).

When the data are independently and normally distributed, they derive the

distribution of b

Rn(1, p), which is equivalent to the distribution of the mini-

mum angle to the north pole among prandom points uniformly distributed

on the (n+ 1)-dimensional sphere. Fan, Guo and Hao (2012) conducted

simulations to demonstrate that the spurious correlation can be very high

when pis large and grows quickly with s. To demonstrate this eﬀect and to

examine the impact of correlation and sample size, we conduct a similar but

more extensive simulation study based on a combination of the stepwise ad-

dition and branch-and-bound algorithms. We simulate Xfrom N(0,Ip) and

N(0,Σ0), where Σ0is block diagonal with the ﬁrst block being a 500 ×500

equi-correlation matrix with a correlation 0.8 and the second block being

the (p−500) ×(p−500) identity matrix. Yis simulated independently of

Xand follows the standard normal distribution. Figure 2depicts the simu-

lation results for n= 50,100 and 200. Clearly, the distributions depend on

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 5

Fig 2: Distributions of maximum spurious correlations for p= 1000 and

s= 1,2,5 and 10 when Σis the identity matrix (left panel) or block diagonal

(right panel) with the ﬁrst block being a 500 ×500 equi-correlation matrix

with a correlation 0.8 and the second block being the 500 ×500 identity

matrix. From top to bottom: n= 50,100 and 200.

(s, p, n) and Σ, the covariance matrix of X, although the dependence on Σ

does not seem very strong. However, the theoretical result of Fan, Guo and

Hao (2012) covers only the very speciﬁc case where s= 1 and Σ=Ip.

There are several challenges to deriving the asymptotic distribution of the

statistic b

Rn(s, p), as it involves combinatorial optimization. Further techni-

cal complications are added by the dependence among the covariates X.

Nevertheless, under the restricted eigenvalue condition [Bickel, Ritov and

Tsybakov (2009)] on Σ, in this paper, we derive the asymptotic distribution

of such a spurious correlation statistic for both a ﬁxed sand a diverging s,

using the empirical process and Gaussian approximation techniques given in

Chernozhukov, Chetverikov and Kato (2014a). As expected, such distribu-

tions depend on the unknown covariance matrix Σ. To provide a consistent

estimate of the distributions of the spurious correlations, we consider the

use of a multiplier bootstrap method and demonstrate its consistency under

mild conditions. The multiplier bootstrap procedure has been widely used

due to its good numerical performance. Its theoretical validity is guaranteed

6J. FAN, Q.-M. SHAO AND W.-X. ZHOU

by the multiplier central limit theorem [van der Vaart and Wellner (1996)].

For the most advanced recent results, we refer to Chatterjee and Bose (2005),

Arlot, Blanchard and Roquain (2010) and Chernozhukov, Chetverikov and

Kato (2013). In particular, Chernozhukov, Chetverikov and Kato (2013) de-

veloped a number of non-asymptotic results on a multiplier bootstrap for

the maxima of empirical mean vectors in high dimensions with applications

to multiple hypothesis testing and parameter choice for the Dantzig selec-

tor. The use of multiplier bootstrapping enables us to empirically compute

the upper conﬁdence limit of b

Rn(s, p) and hence decide whether discover-

ies by statistical machine learning techniques are any better than spurious

correlations.

The rest of this paper is organized as follows. Section 2discusses the

concept of spurious correlation and introduces the main conditions and no-

tation. Section 3presents the main results of the asymptotic distributions of

spurious correlations and their bootstrap approximations, which are further

extended in Section 4. Section 5identiﬁes three important applications of

our results to high-dimensional statistical inference. Section 6presents the

numerical studies. The proof of Theorem 3.1 is provided in Section 7, and

the proofs for the remaining theoretical results are provided in the supple-

mentary material [Fan, Shao and Zhou (2017)].

2. Spurious correlation, conditions, and notation. Let ε, ε1, . . . , εn

be i.i.d. random variables with a mean of zero and a ﬁnite variance σ2>0,

and let X,X1,...,Xnbe i.i.d. p-dimensional random vectors with a mean

of zero and a covariance matrix Σ=E(XXT)=(σjk )1≤j,k≤p. Write

X= (X1, . . . , Xp)T,Xi= (Xi1, . . . , Xip)T, i = 1, . . . , n.

Assume that the two samples {εi}n

i=1 and {Xi}n

i=1 are independent. Then,

the spurious correlation (1.1) can be written as

(2.1) b

Rn(s, p) = max

α∈Sp−1:|α|0=sdcorrnε, αTX,

where the dimension pand sparsity sare allowed to grow with the sample

size n. Here dcorrn(·,·) denotes the sample Pearson correlation coeﬃcient

and Sp−1:= {α∈Rp:|α|2= 1}is the unit sphere of Rp. Due to the anti-

symmetric property of the sample correlation under the sign transformation

of α, we have also

(2.2) b

Rn(s, p) = max

α∈Sp−1:|α|0=s|dcorrnε, αTX|,

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 7

More speciﬁcally, we can express b

Rn(s, p) as

max

S⊆[p]:|S|=smax

α∈Ss−1Pn

i=1(εi−¯εn)α,Xi,S −¯

Xn,S

qPn

i=1(εi−¯εn)2·Pn

i=1 α,Xi,S −¯

Xn,S2.(2.3)

By the scale-invariance property of b

Rn(s, p), we assume without loss of gen-

erality that σ2= 1 and Σis a correlation matrix, so that diag(Σ) = Ip.

For a random variable X, the sub-Gaussian norm kXkψ2and sub-exponential

norm kXkψ1of Xare deﬁned, respectively, as

kXkψ2= sup

q≥1

q−1/2E|X|q1/q and kXkψ1= sup

q≥1

q−1E|X|q1/q.

A random variable Xthat satisﬁes kXkψ2<∞(resp., kXkψ1<∞) is called

a sub-Gaussian (resp., sub-exponential) random variable [Vershynin (2012)].

The following moment conditions for ε∈Rand X∈Rpare imposed.

Condition 2.1.There exists a random vector Usuch that X=Σ1/2U,

E(U) = 0,E(UUT) = Ipand K1:= supα∈Sp−1kαTUkψ2<∞. The random

variable εhas a zero mean and unit variance, and is sub-Gaussian with

K0:= kεkψ2<∞. Moreover, write vq=E(|ε|q)for q≥3.

The following is our assumption for the sampling process.

Condition 2.2.{εi}n

i=1 and {Xi}n

i=1 are independent random samples

from the distributions of εand X, respectively.

For 1 ≤s≤p, the s-sparse minimal and maximal eigenvalues [Bickel,

Ritov and Tsybakov (2009)] of the covariance matrix Σare deﬁned as

φmin(s) = min

u∈Rp:1≤|u|0≤s|u|Σ/|u|22, φmax(s) = max

u∈Rp:1≤|u|0≤s|u|Σ/|u|22,

where |u|Σ= (uTΣu)1/2and |u|2= (uTu)1/2is the `2-norm of u. Con-

sequently, for 1 ≤s≤p, the s-sparse condition number of Σis given by

(2.4) γs=γs(Σ) = pφmax(s)/φmin (s).

The quantity γsplays an important role in our analysis.

The following notation is used. For the two sequences {an}and {bn}

of positive numbers, we write an=O(bn) or an.bnif there exists a

constant C > 0 such that an/bn≤Cfor all suﬃciently large n; we write

8J. FAN, Q.-M. SHAO AND W.-X. ZHOU

anbnif there exist constants C1, C2>0 such that, for all nlarge enough,

C1≤an/bn≤C2; and we write an∼bnand an=o(bn) if limn→∞ an/bn= 1

and limn→∞ an/bn= 0, respectively. For a, b ∈R, we write a∨b= max(a, b)

and a∧b= min(a, b). For every vector u, we denote by |u|q=Pi≥1|ui|q1/q

for q > 0 and |u|0=Pi≥1I{ui6= 0}. We use hu,vi=uTvto denote the

inner product of two vectors uand vwith the same dimension and kMkto

denote the spectral norm of a matrix M. For every positive integer `, we write

[`] = {1,2, . . . , `}, and for any set S, we use Scto denote its complement

and |S|for its cardinality. For each p-dimensional vector uand p×ppositive

semi-deﬁnite matrix A, we write |u|A= (uTAu)1/2. In particular, put

(2.5) αΣ=α/|α|Σ

for every α∈Rpand set 0Σ=0as the convention.

3. Distributions of maximum spurious correlations. In this sec-

tion, we ﬁrst derive the asymptotic distributions of the maximum spurious

correlation b

Rn(s, p). The analytic form of such asymptotic distributions can

be obtained in the isotropic case. As the asymptotic distributions of b

Rn(s, p)

depend on the unknown covariance matrix Σ, we provide a bootstrap esti-

mate and demonstrate its consistency.

3.1. Asymptotic distributions of maximum spurious correlations. In view

of (2.3), we can rewrite b

Rn(s, p) as

b

Rn(s, p) = sup

f∈F

n−1Pn

i=1(εi−¯εn)f(Xi−¯

Xn)

pn−1Pn

i=1(εi−¯εn)2·qn−1Pn

i=1 f2(Xi−¯

Xn)

,(3.1)

where ¯εn=n−1Pn

i=1 εi,¯

Xn=n−1Pn

i=1 Xiand

(3.2) F=F(s, p) = x7→ fα(x) := hα,xi:α∈ V

is a class of linear functions Rp7→ R, where V=V(s, p) = {α∈Sp−1:

|α|0=s}. The dependence of Fand Von (s, p) is suppressed.

Let Z= (Z1, . . . , Zp)Tbe a p-dimensional Gaussian random vector with

a mean of zero and the covariance matrix Σ, i.e., Zd

=N(0,Σ). Denote by

Z2

(1) ≤Z2

(2) ≤ ··· ≤ Z2

(p)the order statistics of {Z2

1, . . . , Z2

p}. The following

theorem shows that the distribution of the maximum absolute multiple cor-

relation b

Rn(s, p) can be approximated by that of the supremum of a centered

Gaussian process G∗indexed by F.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 9

Theorem 3.1.Let Conditions 2.1 and 2.2 hold, n, p ≥2and 1≤s≤p.

Then there exists a constant C > 0independent of (s, p, n)such that

sup

t≥0P√nb

Rn(s, p)≤t−PR∗(s, p)≤t

≤C(K0K1)3/4n−1/8{sbn(s, p)}7/8,(3.3)

where K0and K1are deﬁned in Condition 2.1,bn(s, p) := log(γsp/s)∨log n

for γsas in (2.4), R∗(s, p) := supf∈F G∗fand G∗={G∗f}f∈F is a centered

Gaussian process indexed by Fdeﬁned as, for every fα∈ F,

(3.4) G∗fα=αT

ΣZ=αTZ

√αTΣα.

In particular, if Σ=Ipand slog(pn) = o(n1/7), then as n→ ∞,

sup

t≥0Pnb

R2

n(s, p)≤t−PZ2

(p)+· ·· +Z2

(p−s+1) ≤t→0.(3.5)

Remark 3.1.The Berry-Esseen bound given in Theorem 3.1 depends

explicitly on the triplet (s, p, n), and it depends on the covariance matrix Σ

only through its s-sparse condition number γs, deﬁned in (2.4). The proof

of (3.3) builds on a number of technical tools including a standard covering

argument, maximal and concentration inequalities for the suprema of un-

bounded empirical processes and Gaussian processes as well as a coupling in-

equality for the maxima of sums of random vectors derived in Chernozhukov,

Chetverikov and Kato (2014a). Instead, if we directly resort to the general

framework in Theorem 2.1 of Chernozhukov, Chetverikov and Kato (2014a),

the function class of interest is F=x7→ αTx

(αTΣα)1/2:α∈Sp−1,|α|0=s.

Checking high-level conditions in Theorem 2.1 can be rather complicated

and less intuitive. Also, dealing with the (uniform) entropy integral that

corresponds to the class Frelies on verifying various VC-type properties,

and thus can be fairly tedious. Following a strategy similar to that used to

prove Theorem 2.1, we provide a self-contained proof of Theorem 3.1 in Sec-

tion 7.2 by making the best use of the speciﬁc structure of F. The proof is

more intuitive and straightforward. More importantly, it leads to an explicit

non-asymptotic bound under transparent conditions.

Remark 3.2.In Theorem 3.1, the independence assumption of εand X

can be relaxed as E(εX) = 0, E(ε2|X) = σ2and E(ε4|X)≤Calmost surely,

where C > 0 is a constant.

10 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Expression (3.5) indicates that the increment n{b

R2

n(s, p)−b

R2

n(s−1, p)}

is approximately the same as Z2

(p−s+1). This can simply be seen from the

asymptotic joint distribution of b

Rn(1, p),b

Rn(2, p),..., b

Rn(s, p). The fol-

lowing proposition establishes the approximation of the joint distributions

when both the dimension pand sparsity sare allowed to diverge with the

sample size n.

Proposition 3.1.Let Conditions 2.1 and 2.2 hold with Σ=Ip. Assume

that the triplet (s, p, n)satisﬁes 1≤s < n ≤pand s2log p=o(n1/7). Then

as n→ ∞,

sup

0≡t0<t1<t2<···<ts<1Ps

\

k=1 b

Rn(k, p)≤tk

−Ps

\

k=1 Z2

(p−k+1) ≤n(t2

k−t2

k−1)→0.

Remark 3.3.When s= 1 and if (n, p) satisﬁes log p=o(n1/7), it is

straightforward to verify that, for any t∈R,

(3.6) PZ2

(p)−2 log p+ log(log p)≤t→exp(−π−1/2e−t/2) as p→ ∞.

This result is similar in nature to (5) in Fan, Guo and Hao (2012). In fact, it

is proved in Shao and Zhou (2014) that the extreme-value statistic b

Rn(1, p)

is sensitive to heavy-tailed data in the sense that, under the ultra-high di-

mensional scheme, even the law of large numbers for the maximum spurious

correlation requires exponentially light tails of the underlying distribution.

We refer readers to Theorem 2.1 in Shao and Zhou (2014) for details. There-

fore, we believe that the exponential-type moment assumptions required in

Theorem 3.1 cannot be weakened to polynomial-type ones as long as log pis

allowed to be as large as ncfor some c∈(0,1). However, it is worth mention-

ing that the factor 1/7 in Proposition 3.1 may not be optimal, and according

to the results in Shao and Zhou (2014), 1/3 is the best possible factor to

ensure that the asymptotic theory is valid. To close this gap in theory, a

signiﬁcant amount of additional work and new probabilistic techniques are

needed. We do not pursue this line of research in this paper.

For a general s≥2, we establish in the following proposition the limiting

distribution of the sum of the top sorder statistics of i.i.d. chi-square random

variables with degree of freedom 1.

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 11

Proposition 3.2.Assume that s≥2is a ﬁxed integer. For any t∈R,

we have as p→ ∞,

PZ2

(p)+· ·· +Z2

(p−s+1) −sap≤t

−→ π(1−s)/2

(s−1)!Γ(s−1) Zt/s

−∞ Z(t−sv)/2

0

us−2e−udue−(s−1)v/2g(v)dv,(3.7)

where ap= 2 log p−log(log p),G(t) = exp(−π−1/2e−t/2)and g(t) = G0(t) =

e−t/2

2√πG(t). The above integral can further be expressed as

G(t/s) + π1−s/2e−t/2

(s−1)! Zt/s

−∞

eug(u)du +π(1−s)/2e−t/2

(s−1)!

×

s−2

X

j=1 G(t/s)e(j+1)t/(2s)πj/2

j

Y

`=1

(s−`)−1

j!2jZt/s

−∞

(t−sv)jev/2g(v)dv.

(3.8)

In particular, when s= 2, the last term on the right-hand side of (3.8)

vanishes so that, as p→ ∞,

PZ2

(p)+Z2

(p−1) −2ap≤t→G(t/2) + e−t/2

2√πZt/2

−∞

eu/2G(u)du.

The proofs of Propositions 3.1 and 3.2 are placed in the supplemental

material [Fan, Shao and Zhou (2017)].

3.2. Multiplier bootstrap approximation. The distribution of R∗(s, p) =

supf∈F G∗ffor G∗in (3.4) depends on the unknown Σand thus cannot

be used for statistical inference. In the following, we consider the use of

a Monte Carlo method to simulate a process that mimics G∗, now known

as the multiplier (wild) bootstrap method, which is similar to that used in

Hansen (1996), Barrett and Donald (2003) and Chernozhukov, Chetverikov

and Kato (2013), among others.

Let b

Σnbe the sample covariance matrix based on the data {Xi}n

i=1 and

ξ1, . . . , ξnbe i.i.d. standard normal random variables that are independent

of {εi}n

i=1 and {Xi}n

i=1. Then, given {Xi}n

i=1,

(3.9) Zn=n−1/2

n

X

i=1

ξi(Xi−¯

Xn)∼N(0,b

Σn).

12 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

The following result shows that the (unknown) distribution of R∗(s, p) =

supfα∈F fα(Z)

√αTΣαfor Zd

=N(0,Σ) can be consistently estimated by the con-

ditional distribution of

(3.10) RMB

n(s, p) := sup

fα∈F

fα(Zn)

qαTb

Σnα

.

Theorem 3.2.Let Conditions 2.1 and 2.2 hold. Assume that the triplet

(s, p, n)satisﬁes 1≤s≤pand slog(γspn) = o(n1/5). Then as n→ ∞,

(3.11) sup

t≥0PR∗(s, p)≤t−PRMB

n(s, p)≤tX1,...,XnP

−→ 0.

Remark 3.4.Together, Theorems 3.1 and 3.2 show that the maximum

spurious correlation b

Rn(s, p) can be approximated in distribution by the

multiplier bootstrap statistic n−1/2RMB

n(s, p). In practice, when the sample

size nis relatively small, the value of n−1/2RMB

n(s, p) may exceed 1, which

makes it less favorable as a proxy for spurious correlation. To address this

issue, we propose using the following corrected bootstrap approximation:

(3.12) RCMB

n(s, p) := sup

fα∈F

√n

|ξ|2

fα(Zn)

qαTb

Σnα

,

where ξ= (ξ1, . . . , ξn)Tis used in the deﬁnition of Zn. By the Cauchy-

Schwarz inequality, n−1/2RCMB

n(s, p) is always between 0 and 1. In view of

(3.10) and (3.12), RCMB

n(s, p) diﬀers from RMB

n(s, p) only up to a multiplica-

tive random factor n−1/2|ξ|2, which in theory is concentrated around 1 with

exponentially high probability. Thus, RMB

nand RCMB

nare asymptotically

equivalent, and (3.11) remains valid with RMB

nreplaced by RCMB

n.

4. Extension to sparse linear models. Suppose that the observed

response Yand p-dimensional covariate Xfollows the sparse linear model

(4.1) Y=XTβ∗+ε,

where the regression coeﬃcient β∗is sparse. The sparsity is typically ex-

plored by the LASSO [Tibshirani (1996)], the SCAD [Fan and Li (2001)], or

the MCP [Zhang (2010)]. Now it is well-known that, under suitable condi-

tions, the SCAD and the MCP, among other folded concave penalized least-

square estimators, also enjoy the unbiasedness property and the (strong)

oracle properties. For simplicity, we focus on the SCAD. For a given random

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 13

sample {(Xi, Yi)}n

i=1, the SCAD exploits the sparsity by pλ-regularization,

which minimizes

(4.2) (2n)−1

n

X

i=1

(Yi−XT

iβ)2+

p

X

j=1

pλ(|βj|;a)

over β= (β1, . . . , βp)T∈Rp, where pλ(·;a) denotes the SCAD penalty

function [Fan and Li (2001)], i.e., p0

λ(t;a) = λI(t≤λ) + (aλ−t)+

a−1I(t>λ) for

some a > 2, and λ=λn≥0 is a regularization parameter.

Denote by X= (X1,...,Xn)Tthe n×pdesign matrix, Y= (Y1, . . . , Yn)T

the n-dimensional response vector, and ε= (ε1, . . . , εn)T, the n-dimensional

noise vector. Without loss of generality, we assume that β∗= (βT

1,βT

2)T

with each component of β1∈Rsbeing non-zero and β2=0, such that

S0:= supp(β∗) = {1, . . . , s}is the true underlying sparse model of the

indices with s=|β∗|0. Moreover, write X= (X1,X2), where X1∈Rn×s

consists of the columns of Xindexed by S0. In this notation, Y=Xβ+ε=

X1β1+εand the oracle estimator b

βoracle has an explicit form of

b

βoracle

1= (XT

1X1)−1XT

1Y=β1+ (XT

1X1)−1XT

1ε,b

βoracle

2=0.(4.3)

In other words, the oracle estimator is the unpenalized estimator that min-

imizes Pn

i=1(Yi−XT

i,S0βS0)2over the true support set S0.

Denote by b

εoracle = (bεoracle

1,...,bεoracle

n)T=Y−XTb

βoracle the residuals af-

ter the oracle ﬁt. Then, we can construct the maximum spurious correlation

as in (2.2), except that {εi}n

i=1 is now replaced by {bεoracle

i}n

i=1, i.e.,

b

Roracle

n(1, p)

= max

j∈[p]|Pn

i=1(bεoracle

i−eT

nb

εoracle)(Xij −¯

Xj)|

qPn

i=1(bεoracle

i−eT

nb

εoracle)2·qPn

i=1(Xij −¯

Xj)2

,(4.4)

where en= (1/n, . . . , 1/n)T∈Rnand ¯

Xj=n−1Pn

i=1 Xij. We here deal

with the speciﬁc case of a spurious correlation of size 1, as this is what is

needed for testing the exogeneity assumption (1.2).

To establish the limiting distribution of b

Roracle

n(1, p), we make the follow-

ing assumptions.

Condition 4.1.Y=Xβ∗+εwith supp (β∗) = {1, . . . , s}and ε=

(ε1, . . . , εn)Tbeing i.i.d. centered sub-Gaussian satisfying that K0=kεikψ2<

∞. The rows of X= (X1,...,Xn)Tare i.i.d. sub-Gaussian random vectors

as in Condition 2.1.

14 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

As before, we can assume that Σ=E(XiXT

i) is a correlation matrix with

diag (Σ) = Ip. Set d=p−sand partition

Σ=Σ11 Σ12

Σ21 Σ22 with Σ11 ∈Rs×s,Σ22 ∈Rd×d,Σ21 =ΣT

12.(4.5)

Let Σ22.1= (eσjk)1≤j,k≤d=Σ22 −Σ21Σ−1

11 Σ12 be the Schur complement of

Σ11 in Σ.

Condition 4.2.eσmin = min1≤j≤deσjj is bounded away from zero.

Theorem 4.1.Assume that Conditions 4.1 and 4.2 hold, and that the

triplet (s, p, n)satisﬁes slog p=o(√n)and log p=o(n1/7). Then the maxi-

mum spurious correlation b

Roracle

n(1, p)in (4.4)satisﬁes that, as n→ ∞,

sup

t≥0P√nb

Roracle

n(1, p)≤t−P|e

Z|∞≤t→0,(4.6)

where e

Zd

=N(0,Σ22.1)is a d-variate centered Gaussian random vector with

covariance matrix Σ22.1.

As pλis a folded-concave penalty function, (4.2) is a non-convex opti-

mization problem. The local linear approximation (LLA) algorithm can be

applied to produce a certain local minimum for any ﬁxed initial solution

[Zou and Li (2008), Fan, Xue and Zou (2014)]. In particular, Fan, Xue and

Zou (2014) prove that the LLA algorithm can deliver the oracle estimator

in the folded concave penalized problem with overwhelming probability if it

is initialized by some appropriate initial estimator.

Let b

βLLA be the estimator computed via the one-step LLA algorithm

initiated by the LASSO estimator [Tibshirani (1996)]. That is,

(4.7) b

βLLA = arg min

β(2n)−1

n

X

i=1

(Yi−XT

iβ)2+

p

X

j=1

p0

λ(|b

βLASSO

j|)|βj|,

where pλis a folded concave penalty, such as the SCAD and MCP penalties,

and b

βLASSO = arg minβ{(2n)−1Pn

i=1(Yi−XT

iβ)2+λ|β|1}. Accordingly,

denote by b

RLLA

n(1, p) the maximum spurious correlation as in (4.4) with

bεoracle

ireplaced by bεLLA

i=Yi−XT

ib

βLLA. Applying Theorem 4.1, we derive

the limiting distribution of b

RLLA

n(1, p) under suitable conditions. First, let

us recall the Restricted Eigenvalue concept formulated by Bickel, Ritov and

Tsybakov (2009).

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 15

Definition 4.1.For any integer s0∈[p]and positive number c0, the

RE (s0, c0)parameter κ(s0, c0,A)of a p×pmatrix Ais deﬁned as

(4.8) κ(s0, c0,A) := min

S⊆[p]:|S|≤s0

min

δ6=0:|δSc|1≤c0|δS|1

δTAδ

|δS|2

2

.

Theorem 4.2.Assume that Conditions 4.1 and 4.2 hold, the minimal

signal strength of β∗satisﬁes minj∈S0|βj|>(a+ 1)λfor a, λ as in (4.2),

and that the triplet (s, p, n)satisﬁes slog p=o(√n),slog p

κ(s,3+,Σ)=o(n)for

some > 0and log p=o(n1/7). If the regularization parameters (λ, λLASSO)

are such that λ≥8√s

κ(s,3,Σ)λLASSO and λLASSO ≥CK0p(log p)/n for C > 0

large enough, then as n→ ∞,

sup

t≥0P√nb

RLLA

n(1, p)≤t−P|e

Z|∞≤t→0,(4.9)

where e

Zd

=N(0,Σ22.1).

5. Applications to high-dimensional inferences. This section out-

lines three applications in high-dimensional statistics. The ﬁrst determines

whether discoveries by machine learning and data mining techniques are any

better than those reached by chance. Second, we show that the distributions

of maximum spurious correlations can also be applied to model selection. In

the third application, we validate the fundamental assumption of exogeneity

(1.2) in high dimensions.

5.1. Spurious discoveries. Let qCMB

α(s, p) be the upper α-quantile of the

random variable RCMB

n(s, p) deﬁned by (3.12). Then, an approximate 1 −α

upper conﬁdence limit of the spurious correlation is given by qCMB

α(s, p). In

view of Theorems 3.1 and 3.2, we claim that

(5.1) Pb

Rn(s, p)≤qCMB

α(s, p)→1−α.

To see this, recall that RCMB

n=√nRMB

n/|ξ|2for ξ= (ξ1, . . . , ξn)Tas in

(3.12), and given {Xi}n

i=1,RMB

nis the supremum of a Gaussian process. Let

FMB

n(t) = P{RMB

n(s, p)≤t|X1,...,Xn}be the (conditional) distribution

function of RMB

nand deﬁne t0= inf{t:FMB

n(t)>0}. By Theorem 11.1

of Davydov, Lifshits and Smorodina (1998), FMB

nis absolutely continuous

with respect to the Lebesgue measure and is strictly increasing on (t0,∞),

indicating that P{RCMB

n≤qCMB

α(s, p)|X1,...,Xn}=αalmost surely. This,

together with (3.3) and (3.11), proves (5.1) under Conditions 2.1,2.2, and

when slog(γspn) = o(n1/7).

16 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

Let b

Yibe ﬁtted values using spredictors indexed by b

Sselected by a data-

driven technique and Yibe the associated response value. They are denoted

in the vector form by b

Yand Y, respectively. If

(5.2) dcorrn(Y,b

Y)≤qCMB

α(s, p),

then the discovery of variables b

Scan be regarded as spurious; that is, no bet-

ter than by chance. Therefore, the multiplier bootstrap quantile qCMB

α(s, p)

provides an important critical value and yardstick for judging whether the

discovery is spurious, or whether the selected set b

Sincludes too many spu-

rious variables. This yardstick is independent of the method used in the

ﬁtting.

Remark 5.1.The problem of judging whether the discovery is spuri-

ous is intrinsically diﬀerent from that of testing the global null hypothesis

H0:β∗=0, which itself is an important problem in high-dimensional statis-

tical inference and has been well-studied in the literature since the seminal

work of Goeman, van de Geer and van Houwelingen (2006). For example, the

global null hypothesis H0:β∗=0can be rejected by a test; still, the corre-

lation between Yand the variables b

Sselected by a statistical method can be

smaller than the maximum spurious correlation, and we should interpret the

ﬁndings of b

Swith caution. We need either more samples or more powerful

variable selection methods. This motivates us to derive the distribution of

the maximum spurious correlation b

Rn(s, p). This distribution serves as an

important benchmark for judging whether the discovery (of sfeatures from

pexplanatory variables based on a sample of size n) is spurious. The magni-

tude of b

Rn(s, p) gives statisticians an idea of how big a spurious correlation

can be, and therefore an idea of how much the covariates really contribute

to the regression for a given sample size.

5.2. Model selection. In the previous section, we consider the reference

distribution of the maximum spurious correlation statistic b

Rn(s, p) as a

benchmark for judging whether the discovery of ssigniﬁcant variables (among

all of the pvariables using a random sample of size n) is impressive, regard-

less of which variable selection tool is applied. In this section, we show how

the distribution of b

Rn(s, p) can be used to select a model. Intuitively, we

would like to select a model that ﬁts better than the spurious ﬁt. This limits

the candidate sets of models and provides an upper bound on the model

size. In our experience, this upper bound itself provides a model selector.

We now use LASSO as an illustration of the above idea. Owing to spurious

correlation, almost all of the variable selection procedures will, with high

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 17

probability, select a number of spurious variables in the model so that the

selected model is over-ﬁtted. For example, the LASSO method with the

regularization parameter selected by cross-validation typically selects a far

larger model size, as the bias caused by the `1penalty forces the cross-

validation procedure to choose a smaller value of λ. Thus, it is important

to stop the LASSO path earlier and the quantiles of b

Rn(s, p) provide useful

guards.

Speciﬁcally, consider the LASSO estimator b

βλfor the sparse linear model

(4.1) with bsλ=|supp(b

βλ)|, where λ > 0 is the regularization parameter.

We consider the LASSO solution path with the largest knot λini := |XTY|∞

and the smallest knot λcv selected by ten-fold cross-validation. To avoid

over-ﬁtting, we propose using qCMB

αas a guide to choose the regularization

parameter that guards us from selecting too many spurious variables. For

each λin the path, we compute dcorrn(b

Yλ,Y), the sample correlation between

the post-LASSO ﬁtted and observed responses, and qCMB

α(bsλ, p). Let b

λαbe

the largest λsuch that the sign of dcorrn(b

Yλ,Y)−qCMB

α(bsλ, p) is nonnegative

and then ﬂips in the subsequent knot. The selected model is given by b

Sα=

supp(b

βb

λα). As demonstrated by the simulation studies in Section 6.4, this

procedure selects a much smaller model size that is closer to the real data.

5.3. Validating exogeneity. Fan and Liao (2014) show that the exogenous

condition (1.2) is necessary for penalized least-squares to achieve a model

selection consistency. They question the validity of such an exogeneous as-

sumption, as it imposes too many equations. They argue further that even

when the exogenous model holds for important variables XS, i.e.,

(5.3) Y=XT

Sβ∗

S+ε, E(εXS) = 0,

the extra variables XN(with N=Sc) are collected in an eﬀort to cover the

unknown set S— but no veriﬁcation of the conditions

(5.4) E(εXN) = E{(Y−XT

Sβ∗

S)XN}=0

has ever been made. The equality E{(Y−XT

Sβ∗

S)Xj}= 0 in (5.4) holds

by luck for some covariate Xj, but it can not be expected that this holds

for all j∈N. They propose a focussed generalized method of moment

(FGMM) to avoid the unreasonable assumption (5.4). Recognizing (5.3) is

not identiﬁable in high-dimensional linear models, they impose additional

conditions such as E(εX2

S) = 0.

Despite its fundamental importance to high-dimensional statistics, there

are no available tools for validating (1.2). Regarding (1.2) as a null hypoth-

esis, an asymptotically α-level test can be used to reject assumption (1.2)

18 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

when

(5.5) b

Tn,p = max

j∈[p]√ndcorrn(Xj, ε)≥qCMB

α(1, p).

By Theorems 3.1 and 3.2, the test statistic has an approximate size α. The

p-value of the test can be computed via the distribution of the Gaussian

multiplier process RCMB

n(1, p).

As pointed out in the introduction, when the components of Xare weakly

correlated, the distribution of the maximum spurious correlation does not

depend very sensitively on Σ. See also Lemma 6 in Cai, Liu and Xia (2014).

In this case, we can approximate it by the identity matrix, and hence one

can compare the renormalized test statistic

(5.6) Jn,p =b

T2

n,p −2 log p+ log(log p)

with the limiting distribution in (3.6). The critical value for test statistic

Jn,p is

(5.7) Jα=−2 log{−√πlog(1 −α)},

and the associated p-value is given by

(5.8) exp(−π−1/2e−Jn,p/2).

Expressions (5.7) and (5.8) provide analytic forms for a quick validation of

the exogenous assumption (1.2) under weak dependence. In general, we rec-

ommend using the wild bootstrap, which takes into account the correlation

eﬀect and provides more accurate estimates especially when the dependence

is strong. See Chang et al. (2017) for more empirical evidences.

In practice, εis typically unknown to us. Therefore, b

Tn,p in (5.5) is calcu-

lated using the ﬁtted residuals {bεLLA

i}n

i=1. In view of Theorem 4.2, we need

to adjust the null distribution according to (4.9). By Theorem 3.2, we adjust

the deﬁnition of the process Znin (3.9) by

(5.9) ZLLA

n=n−1/2

n

X

i=1

ξi(XLLA

i−XLLA

n)∈Rp−|

b

S|,

where XLLA

i=Xi,

b

N−b

Σb

N

b

Sb

Σ−1

b

S

b

SXi,

b

Sis the residuals of Xb

Nregressed on

Xb

S, where b

Sis the set of selected variables, b

N= [p]\b

S, and b

ΣSS0denotes

the sub-matrix of b

Σncontaining entries indexed by (k, `)∈S×S0. From

(5.9), the multiplier bootstrap approximation of |e

Z|∞is RMB,LLA

n(1, p) =

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 19

|b

D−1/2ZLLA

n|∞, where b

D= diagonal matrix of the sample covariance matrix

of {XLLA

i}n

i=1. Consequently, we reject (1.2) if b

Tn,p > qMB,LLA

α(1, p), where

qMB,LLA

α(1, p) is the (conditional) upper α-quantile of RMB,LLA

n(1, p) given

{Xi}n

i=1.

Remark 5.2.To the best of our knowledge, this is the ﬁrst paper to

consider testing the exogenous assumption (1.2), for which we use the maxi-

mum correlation between covariates and ﬁtted residuals as the test statistic.

A referee kindly informed us in his/her review report that in the context

of speciﬁcation testing, Chernozhukov, Chetverikov and Kato (2013) pro-

pose a similar extreme value statistic and use the multiplier bootstrap to

compute a critical value for the test. To construct marginal test statistics,

they use self-normalized covariances between generated regressors and ﬁtted

residuals obtained via ordinary least squares, whereas we use sample cor-

relations between the covariates and ﬁtted residuals obtained by the LLA

algorithm. We refer readers to Appendix M in the supplementary material

of Chernozhukov, Chetverikov and Kato (2013) for more details.

6. Numerical studies. In this section, Monte Carlo simulations are

used to examine the ﬁnite-sample performance of the bootstrap approxi-

mation (for a given data set) of the distribution of the maximum spurious

correlation (MSC).

6.1. Computation of spurious correlation. First, we observe that b

Rn(s, p)

in (2.2) can be written as b

R2

n(s, p) = bσ−2

εmaxS⊆[p]:|S|=svT

n,S b

Σ−1

SS vn,S , where

bσ2

ε=n−1Pn

i=1(εi−¯εn)2and vn=n−1Pn

i=1(εi−¯εn)(Xi−¯

Xn). Therefore,

the computation of b

Rn(s, p) requires solving the combinatorial optimization

problem

(6.1) b

S= arg max

S⊆[p]:|S|=svT

n,S b

Σ−1

SS vn,S .

It is computationally intensive to obtain b

Sfor large values of pand sas

one essentially needs to enumerate all p

spossible subsets of size sfrom p

covariates. A fast and easily implementable approach is to use the stepwise

addition (forward selection) algorithm as in Fan, Guo and Hao (2012), which

results in some value that is no larger than b

Rn(s, p) but avoids computing all

p

smultiple correlations in (6.1). Note that the optimization (6.1) is equiva-

lent to ﬁnding the best subset regression of size s. When pis relatively small,

say if pranges from 20 to 40, the branch-and-bound procedure is commonly

used for ﬁnding the best subset of a given size that maximizes multiple R2

20 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

[Brusco and Stahl (2005)]. However, this approach becomes computational

infeasible very quickly when there are hundreds or thousands of potential

predictors. As a trade-oﬀ between approximation accuracy and computa-

tional intensity, we propose using a two-step procedure that combines the

stepwise addition and branch-and-bound algorithms. First, we use the for-

ward selection to pick the best dvariables, say d= 40, which serves as a

pre-screening step. Second, across the d

ssubsets of size s, the branch-and-

bound procedure is implemented to select the best subset that maximizes

the multiple-R2. This subset is used as an approximate solution to (6.1).

Note that when s > 40, which is rare in many applications, we only use the

stepwise addition to reduce the computational cost.

6.2. Accuracy of the multiplier bootstrap approximation. For the ﬁrst

simulation, we consider the case where the random noise εfollows the uni-

form distribution standardized so that E(ε) = 0 and E(ε2) = 1. Independent

of ε, the p-variate vector Xof covariates has i.i.d. N(0,1) components. In the

results reported in Table 1, the ambient dimension p= 2000, the sample size

ntakes a value in {400,800,1200}, and stakes a value in {1,2,5,10}. For

a given signiﬁcance level α∈(0,1), let qα(s, p) be the upper α-quantile of

b

Rn(s, p) in (2.1). For each data set Xn={X1,...,Xn}, a direct application

of Theorems 3.1 and 3.2 is that

cMB(Xn, α) := PRCMB

n(s, p)≥qα(s, p)|Xn→αas n→ ∞.

The diﬀerence cMB(Xn, α)−α, however, characterizes the extent of the size

distortions and the ﬁnite-sample accuracy of the multiplier bootstrap ap-

proximation (MBA). Table 1summarizes the mean and the standard de-

viation (SD) of cMB(Xn, α) based on 200 simulated data sets with α∈

{0.05,0.1}. The α-quantile qα(s, p) is calculated from 1600 replications, and

cMB(Xn, α) for each data set is simulated based on 1600 bootstrap replica-

tions. In addition, we report in Figure 3the distributions of the maximum

spurious correlations and their multiplier bootstrap approximations condi-

tional on a given data set Xnwhen p∈ {2000,5000},s∈ {1,2,5,10}and

n= 400. Together, Table 1and Figure 3show that the multiplier bootstrap

method indeed provides a quite good approximation to the (unknown) dis-

tribution of the maximum spurious correlation.

For the second simulation, we focus on an anisotropic case where the co-

variance matrix Σof Xis non-identity, and the condition number of Σis

well-controlled. Speciﬁcally, we assume that εfollows the centered Laplace

distribution rescaled so that E(ε) = 0 and E(ε2) = 1. To introduce de-

pendence among covariates, ﬁrst we denote with Aa 10 ×10 symmet-

ric positive deﬁnite matrix with a pre-speciﬁed condition number c > 1

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 21

Table 1

(Isotropic case) The mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD

in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05

Trule s= 1 s= 2 s= 5 s= 10

n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05

400 9.54 4.68 9.13 4.38 9.08 3.78 8.67 4.44

(0.643) (0.294) (0.568) (0.284) (0.480) (0.245) (0.506) (0.291)

800 9.43 4.93 9.47 4.42 9.73 4.73 9.94 5.62

(0.444) (0.296) (0.474) (0.296) (0.488) (0.294) (0.557) (0.331)

1200 9.09 4.32 9.00 4.46 9.42 4.87 9.97 5.15

(0.507) (0.261) (0.542) (0.278) (0.543) (0.322) (0.579) (0.318)

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Density

0

5

10

15

20

25

30 (n,p) = (400,2000)

MSC-1

MBA-1

MSC-2

MBA-2

MSC-5

MBA-5

MSC-10

MBA-10

Maximum Spurious Correlation vs Multiplier Bootstrap Approximation

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Density

0

10

20

30

40 (n,p) = (400,5000)

Fig 3: Distributions of maximum spurious correlations (blue) and multiplier

bootstrap approximations (for a given data set; red) based on 1600 simula-

tions with combinations of p= 2000,5000, s= 1,2,5,10, and n= 400 when

Σis an identity matrix.

and let ρ∈(0,1). Then the p-dimensional vector Xof the covariates is

22 J. FAN, Q.-M. SHAO AND W.-X. ZHOU

generated according to X=G1(ρ)Z1+G2(ρ)Z2, where Z1d

=N(0,A),

Z2d

=N(0,Ip−10), and G1(ρ)∈Rp×10 ,G2(ρ)∈Rp×(p−10) are given respec-

tively by G1(ρ)T= (I10,ρ

√1+ρ2I10,G11(ρ)T) with

G11(ρ) = 1−ρ

p1 + (1 −ρ)2

1 0 . . . 0

1 0 . . . 0

.

.

..

.

.··· .

.

.

1 0 . . . 0

∈R(p−20)×10

and

G2(ρ) =

010×10 010×(p−20)

1

√1+ρ2I10 010×(p−20)

0(p−20)×10 1

√1+(1−ρ)2Ip−20

.

In particular, we take c= 5 and ρ= 0.8 in the simulations reported in

Table 2, which summarizes the mean and the standard deviation (SD) of

the size cMB(Xn, α) based on 200 simulated data sets with α∈ {0.05,0.1}.

Comparing the simulation results shown in Tables 1and 2, we ﬁnd that

the bootstrap approximation is fairly robust against heterogeneity in the

covariance structure of the covariates.

Table 2

(Anisotropic case) Mean of 200 empirical sizes cMB(·, α)×100, with its estimate of SD

in the parenthesis, when p= 2000,s= 1,2,5,10,n= 400,800,1200, and α= 0.1,0.05

Trule s= 1 s= 2 s= 5 s= 10

n α = 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05 α= 0.1α= 0.05

400 9.83 4.39 9.04 4.75 9.27 4.65 9.34 4.53

(0.426) (0.222) (0.402) (0.208) (0.492) (0.273) (0.557) (0.291)

800 10.18 5.19 10.48 5.12 9.98 4.86 9.21 4.73

(0.556) (0.296) (0.519) (0.272) (0.576) (0.220) (0.474) (0.269)

1200 9.42 4.41 9.60 5.71 9.90 4.85 10.11 5.19

(0.500) (0.233) (0.543) (0.339) (0.553) (0.333) (0.606) (0.337)

6.3. Detecting spurious discoveries. To examine how the multiplier boot-

strap quantile qCMB

α(s, p) (see Section 5.1) serves as a benchmark for judging

whether the discovery is spurious, we compute the Spurious Discovery Prob-

ability (SDP) by simulating 200 data sets from (4.1) with n= 100,120,160,

p= 400, β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp, and standard Gaus-

sian noise εd

=N(0,1). For some integer s≤r≤p, we let xd

=N(0,Ir) be an

r-dimensional Gaussian random vector. Let Γrbe a p×rmatrix satisfying

DISTRIBUTIONS OF MAXIMUM SPURIOUS CORRELATIONS 23

ΓT

rΓr=Ir. The rows of the design matrix Xare sampled as i.i.d. copies from

Γrx∈Rp, where rtakes a value in {120,160,200,240,280,320,360}. To save

space, we give the numerical results for the case of non-Gaussian design and

noise in the supplementary material [Fan, Shao and Zhou (2017)].

Put Y= (Y1, . . . , Yn)Tand let b

Y=Xb

Sb

βpLASSO be the n-dimensional

vector of ﬁtted values, where b

βpLASSO = (XT

b

SXb

S)−1XT

b

SYis the post-LASSO

estimator using covariates selected by the ten-fold cross-validated LASSO

estimator. Let bs=|b

S|0be the number of variables selected. For α∈(0,1),

the level-αSDP is deﬁned as P{|dcorrn(Y,b

Y)| ≤ qCMB

α(bs, p)}. As the simu-

lated model is not null, this SDP is indeed a type II error. Given α= 5% and

for each simulated data set, qCMB

α(s, p) is computed based on 1000 bootstrap

replications. Then we compute the empirical SDP based on 200 simulations.

The results are given in Table 3.

In this study, the design matrix is chosen so that there is a low-dimensional

linear dependency in the high-dimensional covariates. The collected covari-

ates are highly correlated when ris much smaller than p. It is known that

collinearity and high dimensionality add diﬃculty to the problem of variable

selection and deteriorate the performance of the LASSO. The smaller the

ris, the more severe the problem of collinearity becomes. As reﬂected in

Table 3, the empirical SDP increases as rdecreases, indicating that the cor-

relation between ﬁtted and observed responses is more likely to be smaller

than the spurious correlation.

Table 3

Empirical α-level spurious discovery probability (ESDP) based on 200 simulations when

p= 400,n= 100,120,160, and α= 5%.

Trule r= 120 r= 160 r= 200 r= 240 r= 280 r= 320 r= 360

n= 100 0.6950 0.6650 0.6000 0.5200 0.5100 0.4500 0.4000

n= 120 0.6600 0.5350 0.3350 0.3800 0.2500 0.2850 0.1950

n= 160 0.1950 0.1300 0.0500 0.0400 0.0550 0.0700 0.0250

6.4. Model selection. We demonstrate the idea in Section 5.2 through

the following toy example. Consider the linear model (4.1) with (n, p) =

(160,400) and β∗= (1,0,−0.8,0,0.6,0,−0.4,0,...,0)T∈Rp. The covariate

vector is taken to be X=Γx with x= (x1, . . . , x200)T, where x1